[2025-11-13 08:04:09,151][mllm.models.large_language_model_local][INFO] - Initializing adapter 'agent_adapter': no initial weights provided or found; starting from scratch. [2025-11-13 08:04:09,962][mllm.models.adapter_training_wrapper][INFO] - Adapter 'agent_adapter': initialized with fresh weights (no initial weights found). [2025-11-13 08:04:09,969][mllm.models.large_language_model_local][INFO] - Initializing adapter 'critic_adapter': no initial weights provided or found; starting from scratch. [2025-11-13 08:04:11,033][mllm.models.adapter_training_wrapper][INFO] - Adapter 'critic_adapter': initialized with fresh weights (no initial weights found). [2025-11-13 08:06:20,550][__main__][INFO] - Starting iteration 0. [2025-11-13 08:06:20,554][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:06:20,554][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:06:25,283][__main__][INFO] - Number of regex retries in iteration 0: 0 [2025-11-13 08:06:25,284][__main__][INFO] - agents played in iteration 0 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:06:25,694][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.45%, Block Peak % of device VRAM: 18.68%, ΔTime: 00:00:00 [2025-11-13 08:06:25,736][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.45%, Block Peak % of device VRAM: 18.68%, ΔTime: 00:00:00 [2025-11-13 08:06:25,777][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.45%, Block Peak % of device VRAM: 18.68%, ΔTime: 00:00:00 [2025-11-13 08:06:25,818][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.45%, Block Peak % of device VRAM: 18.68%, ΔTime: 00:00:00 [2025-11-13 08:06:25,818][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:06:25,819][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:06:26,442][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:06:27,044][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:06:27,375][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:06:27,702][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:06:28,035][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:06:28,367][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:06:28,696][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:06:29,029][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:06:29,356][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:06:29,684][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:06:30,012][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:06:30,338][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:06:30,664][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:06:30,989][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:06:31,315][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:06:31,643][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:06:31,967][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:06:32,294][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:06:32,619][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:06:32,944][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:06:33,269][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:06:33,593][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:06:33,925][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:06:34,253][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:06:34,585][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:06:34,918][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:06:35,242][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:06:35,567][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:06:35,892][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:06:36,224][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:06:36,552][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:06:36,877][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:06:37,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:06:37,922][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 4.58%, Current % of VRAM taken: 42.03%, Block Peak % of device VRAM: 25.21%, ΔTime: 00:00:11 [2025-11-13 08:06:38,567][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:06:38,568][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:06:38,570][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:06:39,790][__main__][INFO] - Iteration 1 took 19s (24.58% Gen, 69.06% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 59m 2s. Estimated total time: 16h 1m 52s. Time estimates for 10 more iterations: 3m 12s, 100 more iterations: 32m 3s, 500 more iterations: 2h 40m 18s. [2025-11-13 08:06:39,793][__main__][INFO] - Starting iteration 1. [2025-11-13 08:06:39,795][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:06:39,796][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:06:43,397][__main__][INFO] - Number of regex retries in iteration 1: 0 [2025-11-13 08:06:43,398][__main__][INFO] - agents played in iteration 1 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:06:43,819][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:06:43,860][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:06:43,899][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:06:43,940][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:06:43,941][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:06:43,941][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:06:44,629][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:06:44,925][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:06:45,251][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:06:45,581][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:06:45,907][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:06:46,240][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:06:46,573][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:06:46,906][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:06:47,232][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:06:47,559][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:06:47,891][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:06:48,216][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:06:48,544][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:06:48,871][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:06:49,197][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:06:49,526][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:06:49,854][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:06:50,184][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:06:50,514][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:06:50,842][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:06:51,172][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:06:51,507][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:06:51,835][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:06:52,161][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:06:52,488][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:06:52,816][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:06:53,142][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:06:53,468][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:06:53,795][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:06:54,121][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:06:54,447][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:06:54,775][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:06:55,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:06:55,780][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:06:56,490][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:06:56,492][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:06:56,493][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:06:57,476][__main__][INFO] - Iteration 2 took 17s (20.37% Gen, 74.06% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 40m 56s. Estimated total time: 14h 44m 4s. Time estimates for 10 more iterations: 2m 56s, 100 more iterations: 29m 28s, 500 more iterations: 2h 27m 20s. [2025-11-13 08:06:57,478][__main__][INFO] - Starting iteration 2. [2025-11-13 08:06:57,481][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:06:57,482][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:07:01,117][__main__][INFO] - Number of regex retries in iteration 2: 0 [2025-11-13 08:07:01,118][__main__][INFO] - agents played in iteration 2 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:07:01,537][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:01,579][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:01,619][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:01,659][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:01,659][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:07:01,659][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:07:02,352][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:07:02,649][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:07:02,982][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:07:03,311][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:07:03,640][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:07:03,969][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:07:04,296][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:07:04,623][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:07:04,950][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:07:05,280][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:07:05,609][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:07:05,939][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:07:06,266][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:07:06,596][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:07:06,921][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:07:07,249][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:07:07,577][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:07:07,903][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:07:08,229][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:07:08,558][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:07:08,883][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:07:09,212][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:07:09,547][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:07:09,873][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:07:10,199][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:07:10,525][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:07:10,853][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:07:11,181][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:07:11,511][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:07:11,841][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:07:12,170][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:07:12,497][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:07:12,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:07:13,493][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:07:14,245][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:07:14,247][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:07:14,248][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:07:15,297][__main__][INFO] - Iteration 3 took 17s (20.41% Gen, 73.69% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 47m 25s. Estimated total time: 14h 50m 51s. Time estimates for 10 more iterations: 2m 58s, 100 more iterations: 29m 41s, 500 more iterations: 2h 28m 28s. [2025-11-13 08:07:15,304][__main__][INFO] - Starting iteration 3. [2025-11-13 08:07:15,308][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:07:15,309][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:07:18,927][__main__][INFO] - Number of regex retries in iteration 3: 0 [2025-11-13 08:07:18,928][__main__][INFO] - agents played in iteration 3 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:07:19,355][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:19,395][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:19,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:19,473][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:19,473][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:07:19,474][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:07:20,159][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:07:20,457][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:07:20,789][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:07:21,119][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:07:21,449][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:07:21,779][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:07:22,106][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:07:22,431][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:07:22,763][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:07:23,088][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:07:23,416][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:07:23,742][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:07:24,071][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:07:24,397][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:07:24,723][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:07:25,049][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:07:25,375][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:07:25,701][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:07:26,030][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:07:26,362][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:07:26,692][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:07:27,025][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:07:27,354][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:07:27,683][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:07:28,011][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:07:28,344][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:07:28,673][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:07:28,999][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:07:29,329][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:07:29,662][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:07:29,990][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:07:30,322][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:07:30,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:07:31,316][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:07:32,064][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:07:32,066][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:07:32,068][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:07:33,069][__main__][INFO] - Iteration 4 took 17s (20.37% Gen, 73.98% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 44m 23s. Estimated total time: 14h 48m 6s. Time estimates for 10 more iterations: 2m 57s, 100 more iterations: 29m 36s, 500 more iterations: 2h 28m 1s. [2025-11-13 08:07:33,071][__main__][INFO] - Starting iteration 4. [2025-11-13 08:07:33,074][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:07:33,075][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:07:36,706][__main__][INFO] - Number of regex retries in iteration 4: 0 [2025-11-13 08:07:36,707][__main__][INFO] - agents played in iteration 4 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:07:37,157][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:37,196][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:37,235][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:37,274][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:37,274][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:07:37,275][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:07:37,979][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:07:38,276][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:07:38,604][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:07:38,937][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:07:39,263][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:07:39,594][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:07:39,922][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:07:40,248][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:07:40,575][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:07:40,904][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:07:41,234][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:07:41,564][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:07:41,899][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:07:42,234][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:07:42,568][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:07:42,897][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:07:43,224][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:07:43,550][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:07:43,876][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:07:44,205][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:07:44,535][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:07:44,869][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:07:45,198][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:07:45,527][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:07:45,854][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:07:46,180][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:07:46,507][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:07:46,833][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:07:47,159][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:07:47,487][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:07:47,814][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:07:48,141][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:07:48,469][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:07:49,141][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:07:49,878][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:07:49,879][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:07:49,881][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:07:50,871][__main__][INFO] - Iteration 5 took 17s (20.40% Gen, 74.02% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 45m 52s. Estimated total time: 14h 49m 53s. Time estimates for 10 more iterations: 2m 57s, 100 more iterations: 29m 39s, 500 more iterations: 2h 28m 18s. [2025-11-13 08:07:50,873][__main__][INFO] - Starting iteration 5. [2025-11-13 08:07:50,877][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:07:50,878][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:07:54,514][__main__][INFO] - Number of regex retries in iteration 5: 0 [2025-11-13 08:07:54,515][__main__][INFO] - agents played in iteration 5 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:07:54,940][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:54,980][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:55,019][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:55,057][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:55,058][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:07:55,058][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:07:55,765][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:07:56,061][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:07:56,391][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:07:56,721][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:07:57,047][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:07:57,375][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:07:57,700][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:07:58,026][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:07:58,352][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:07:58,678][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:07:59,010][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:07:59,336][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:07:59,664][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:07:59,990][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:08:00,317][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:08:00,642][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:08:00,970][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:08:01,299][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:08:01,625][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:08:01,955][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:08:02,285][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:08:02,611][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:08:02,941][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:08:03,267][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:08:03,601][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:08:03,929][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:08:04,264][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:08:04,597][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:08:04,932][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:08:05,257][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:08:05,583][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:08:05,912][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:08:06,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:08:06,944][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:08:07,699][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:08:07,700][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:08:07,702][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:08:08,672][__main__][INFO] - Iteration 6 took 17s (20.43% Gen, 74.11% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 45m 26s. Estimated total time: 14h 49m 45s. Time estimates for 10 more iterations: 2m 57s, 100 more iterations: 29m 39s, 500 more iterations: 2h 28m 17s. [2025-11-13 08:08:08,674][__main__][INFO] - Starting iteration 6. [2025-11-13 08:08:08,677][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:08:08,678][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:08:12,262][__main__][INFO] - Number of regex retries in iteration 6: 0 [2025-11-13 08:08:12,263][__main__][INFO] - agents played in iteration 6 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:08:12,691][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:12,731][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:12,770][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:12,809][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:12,810][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:08:12,810][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:08:13,511][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:08:13,810][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:08:14,138][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:08:14,467][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:08:14,794][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:08:15,127][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:08:15,456][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:08:15,786][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:08:16,115][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:08:16,444][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:08:16,771][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:08:17,098][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:08:17,424][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:08:17,750][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:08:18,077][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:08:18,406][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:08:18,732][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:08:19,060][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:08:19,387][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:08:19,716][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:08:20,046][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:08:20,374][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:08:20,701][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:08:21,031][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:08:21,357][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:08:21,690][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:08:22,017][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:08:22,342][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:08:22,669][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:08:22,995][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:08:23,322][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:08:23,651][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:08:23,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:08:24,655][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:08:25,419][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:08:25,421][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:08:25,422][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:08:26,413][__main__][INFO] - Iteration 7 took 17s (20.21% Gen, 74.19% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 42m 16s. Estimated total time: 14h 46m 53s. Time estimates for 10 more iterations: 2m 57s, 100 more iterations: 29m 33s, 500 more iterations: 2h 27m 48s. [2025-11-13 08:08:26,416][__main__][INFO] - Starting iteration 7. [2025-11-13 08:08:26,419][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:08:26,420][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:08:30,014][__main__][INFO] - Number of regex retries in iteration 7: 0 [2025-11-13 08:08:30,015][__main__][INFO] - agents played in iteration 7 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:08:30,440][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:30,482][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:30,522][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:30,562][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:30,562][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:08:30,563][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:08:31,264][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:08:31,562][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:08:31,889][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:08:32,219][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:08:32,545][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:08:32,872][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:08:33,199][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:08:33,525][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:08:33,852][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:08:34,179][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:08:34,509][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:08:34,838][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:08:35,168][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:08:35,495][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:08:35,822][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:08:36,149][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:08:36,475][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:08:36,802][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:08:37,129][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:08:37,462][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:08:37,789][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:08:38,117][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:08:38,443][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:08:38,770][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:08:39,099][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:08:39,427][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:08:39,754][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:08:40,080][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:08:40,406][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:08:40,733][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:08:41,060][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:08:41,387][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:08:41,715][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:08:42,413][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:08:43,164][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:08:43,166][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:08:43,168][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:08:44,295][__main__][INFO] - Iteration 8 took 17s (20.11% Gen, 73.57% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 48m 55s. Estimated total time: 14h 53m 50s. Time estimates for 10 more iterations: 2m 58s, 100 more iterations: 29m 47s, 500 more iterations: 2h 28m 58s. [2025-11-13 08:08:44,297][__main__][INFO] - Starting iteration 8. [2025-11-13 08:08:44,301][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:08:44,301][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:08:47,890][__main__][INFO] - Number of regex retries in iteration 8: 0 [2025-11-13 08:08:47,891][__main__][INFO] - agents played in iteration 8 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:08:48,321][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:48,361][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:48,401][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:48,440][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:48,441][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:08:48,441][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:08:49,144][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:08:49,442][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:08:49,771][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:08:50,096][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:08:50,423][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:08:50,749][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:08:51,076][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:08:51,403][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:08:51,730][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:08:52,059][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:08:52,385][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:08:52,712][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:08:53,040][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:08:53,368][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:08:53,694][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:08:54,021][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:08:54,349][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:08:54,683][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:08:55,012][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:08:55,339][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:08:55,666][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:08:55,993][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:08:56,326][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:08:56,656][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:08:56,986][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:08:57,316][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:08:57,649][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:08:57,976][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:08:58,306][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:08:58,633][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:08:58,960][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:08:59,286][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:08:59,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:09:00,297][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:09:01,050][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:09:01,052][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:09:01,054][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:09:02,080][__main__][INFO] - Iteration 9 took 17s (20.19% Gen, 74.03% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 43m 48s. Estimated total time: 14h 49m 1s. Time estimates for 10 more iterations: 2m 57s, 100 more iterations: 29m 38s, 500 more iterations: 2h 28m 10s. [2025-11-13 08:09:02,082][__main__][INFO] - Starting iteration 9. [2025-11-13 08:09:02,086][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:09:02,086][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:09:05,781][__main__][INFO] - Number of regex retries in iteration 9: 0 [2025-11-13 08:09:05,782][__main__][INFO] - agents played in iteration 9 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:09:06,207][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:06,249][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:06,289][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:06,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:06,331][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:09:06,331][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:09:07,033][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:09:07,331][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:09:07,659][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:09:07,986][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:09:08,313][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:09:08,640][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:09:08,967][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:09:09,295][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:09:09,622][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:09:09,948][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:09:10,278][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:09:10,605][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:09:10,931][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:09:11,257][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:09:11,584][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:09:11,910][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:09:12,237][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:09:12,566][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:09:12,893][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:09:13,221][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:09:13,549][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:09:13,876][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:09:14,203][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:09:14,530][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:09:14,858][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:09:15,185][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:09:15,512][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:09:15,839][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:09:16,166][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:09:16,492][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:09:16,822][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:09:17,149][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:09:17,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:09:18,169][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:09:18,921][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:09:18,922][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:09:18,924][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:09:19,927][__main__][INFO] - Iteration 10 took 17s (20.71% Gen, 73.66% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 46m 35s. Estimated total time: 14h 52m 6s. Time estimates for 10 more iterations: 2m 58s, 100 more iterations: 29m 44s, 500 more iterations: 2h 28m 41s. [2025-11-13 08:09:19,929][__main__][INFO] - Starting iteration 10. [2025-11-13 08:09:19,933][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:09:19,934][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:09:23,631][__main__][INFO] - Number of regex retries in iteration 10: 0 [2025-11-13 08:09:23,632][__main__][INFO] - agents played in iteration 10 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:09:24,062][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:24,105][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:24,145][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:24,185][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:24,186][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:09:24,186][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:09:24,893][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:09:25,191][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:09:25,519][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:09:25,850][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:09:26,176][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:09:26,504][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:09:26,830][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:09:27,157][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:09:27,485][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:09:27,811][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:09:28,139][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:09:28,469][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:09:28,800][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:09:29,127][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:09:29,456][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:09:29,783][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:09:30,116][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:09:30,446][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:09:30,773][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:09:31,100][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:09:31,428][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:09:31,755][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:09:32,081][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:09:32,407][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:09:32,737][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:09:33,064][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:09:33,391][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:09:33,718][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:09:34,045][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:09:34,372][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:09:34,698][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:09:35,026][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:09:35,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:09:36,224][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:09:36,987][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:09:36,988][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:09:36,990][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:09:38,954][__main__][INFO] - Iteration 11 took 19s (19.44% Gen, 70.22% Train). Generation: 3s, Training: 13s. Estimated remaining time: 15h 45m 16s. Estimated total time: 15h 51m 6s. Time estimates for 10 more iterations: 3m 10s, 100 more iterations: 31m 42s, 500 more iterations: 2h 38m 31s. [2025-11-13 08:09:38,957][__main__][INFO] - Starting iteration 11. [2025-11-13 08:09:38,960][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:09:38,961][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:09:43,197][__main__][INFO] - Number of regex retries in iteration 11: 0 [2025-11-13 08:09:43,198][__main__][INFO] - agents played in iteration 11 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:09:43,628][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:43,670][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:43,709][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:43,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:43,750][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:09:43,750][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:09:44,458][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:09:44,755][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:09:45,084][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:09:45,410][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:09:45,737][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:09:46,064][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:09:46,392][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:09:46,719][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:09:47,047][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:09:47,374][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:09:47,701][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:09:48,028][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:09:48,355][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:09:48,682][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:09:49,009][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:09:49,336][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:09:49,664][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:09:49,991][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:09:50,319][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:09:50,646][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:09:50,972][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:09:51,299][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:09:51,626][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:09:51,952][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:09:52,280][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:09:52,607][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:09:52,933][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:09:53,262][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:09:53,591][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:09:53,918][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:09:54,245][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:09:54,572][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:09:54,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:09:55,600][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:09:56,344][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:09:56,345][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:09:56,347][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:09:57,406][__main__][INFO] - Iteration 12 took 18s (22.97% Gen, 71.28% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 16m 12s. Estimated total time: 15h 22m 20s. Time estimates for 10 more iterations: 3m 4s, 100 more iterations: 30m 44s, 500 more iterations: 2h 33m 43s. [2025-11-13 08:09:57,408][__main__][INFO] - Starting iteration 12. [2025-11-13 08:09:57,411][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:09:57,412][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:10:01,383][__main__][INFO] - Number of regex retries in iteration 12: 0 [2025-11-13 08:10:01,384][__main__][INFO] - agents played in iteration 12 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:10:01,816][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:01,858][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:01,897][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:01,936][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:01,937][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:10:01,937][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:10:02,644][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:10:02,944][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:10:03,273][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:10:03,606][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:10:03,940][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:10:04,270][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:10:04,600][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:10:04,935][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:10:05,262][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:10:05,590][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:10:05,917][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:10:06,245][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:10:06,574][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:10:06,907][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:10:07,237][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:10:07,564][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:10:07,891][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:10:08,218][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:10:08,545][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:10:08,880][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:10:09,214][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:10:09,541][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:10:09,870][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:10:10,200][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:10:10,526][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:10:10,855][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:10:11,185][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:10:11,514][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:10:11,844][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:10:12,172][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:10:12,501][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:10:12,835][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:10:13,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:10:13,885][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:10:14,646][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:10:14,647][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:10:14,649][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:10:15,831][__main__][INFO] - Iteration 13 took 18s (21.56% Gen, 72.01% Train). Generation: 3s, Training: 13s. Estimated remaining time: 15h 14m 36s. Estimated total time: 15h 21m 3s. Time estimates for 10 more iterations: 3m 4s, 100 more iterations: 30m 42s, 500 more iterations: 2h 33m 30s. [2025-11-13 08:10:15,834][__main__][INFO] - Starting iteration 13. [2025-11-13 08:10:15,836][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:10:15,837][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:10:19,715][__main__][INFO] - Number of regex retries in iteration 13: 0 [2025-11-13 08:10:19,716][__main__][INFO] - agents played in iteration 13 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:10:20,153][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:20,193][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:20,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:20,273][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:20,274][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:10:20,274][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:10:20,973][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:10:21,269][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:10:21,597][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:10:21,925][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:10:22,251][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:10:22,579][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:10:22,908][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:10:23,235][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:10:23,565][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:10:23,894][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:10:24,221][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:10:24,548][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:10:24,877][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:10:25,204][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:10:25,534][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:10:25,861][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:10:26,189][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:10:26,516][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:10:26,843][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:10:27,171][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:10:27,499][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:10:27,827][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:10:28,154][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:10:28,481][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:10:28,809][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:10:29,136][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:10:29,463][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:10:29,790][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:10:30,118][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:10:30,446][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:10:30,779][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:10:31,106][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:10:31,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:10:32,132][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:10:32,868][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:10:32,869][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:10:32,871][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:10:33,934][__main__][INFO] - Iteration 14 took 18s (21.43% Gen, 72.68% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 58m 12s. Estimated total time: 15h 4m 56s. Time estimates for 10 more iterations: 3m 0s, 100 more iterations: 30m 9s, 500 more iterations: 2h 30m 49s. [2025-11-13 08:10:33,937][__main__][INFO] - Starting iteration 14. [2025-11-13 08:10:33,940][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:10:33,941][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:10:37,949][__main__][INFO] - Number of regex retries in iteration 14: 0 [2025-11-13 08:10:37,949][__main__][INFO] - agents played in iteration 14 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:10:38,378][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:38,418][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:38,458][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:38,499][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:38,499][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:10:38,500][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:10:39,211][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:10:39,510][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:10:39,840][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:10:40,170][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:10:40,500][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:10:40,830][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:10:41,157][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:10:41,486][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:10:41,814][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:10:42,141][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:10:42,471][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:10:42,800][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:10:43,127][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:10:43,456][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:10:43,784][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:10:44,112][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:10:44,439][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:10:44,766][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:10:45,096][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:10:45,425][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:10:45,752][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:10:46,080][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:10:46,409][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:10:46,735][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:10:47,064][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:10:47,394][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:10:47,723][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:10:48,052][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:10:48,382][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:10:48,709][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:10:49,039][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:10:49,370][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:10:49,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:10:50,396][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:10:51,308][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:10:51,310][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:10:51,312][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:10:52,578][__main__][INFO] - Iteration 15 took 18s (21.50% Gen, 71.69% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 24m 52s. Estimated total time: 15h 31m 55s. Time estimates for 10 more iterations: 3m 6s, 100 more iterations: 31m 3s, 500 more iterations: 2h 35m 19s. [2025-11-13 08:10:52,580][__main__][INFO] - Starting iteration 15. [2025-11-13 08:10:52,582][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:10:52,583][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:10:56,559][__main__][INFO] - Number of regex retries in iteration 15: 0 [2025-11-13 08:10:56,559][__main__][INFO] - agents played in iteration 15 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:10:56,992][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:57,034][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:57,074][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:57,115][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:57,116][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:10:57,117][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:10:57,804][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:10:58,102][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:10:58,430][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:10:58,759][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:10:59,089][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:10:59,417][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:10:59,750][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:11:00,081][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:11:00,411][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:11:00,740][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:11:01,066][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:11:01,396][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:11:01,724][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:11:02,051][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:11:02,385][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:11:02,715][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:11:03,049][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:11:03,382][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:11:03,717][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:11:04,052][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:11:04,382][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:11:04,713][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:11:05,042][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:11:05,376][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:11:05,707][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:11:06,036][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:11:06,370][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:11:06,701][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:11:07,031][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:11:07,367][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:11:07,699][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:11:08,030][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:11:08,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:11:09,058][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:11:09,804][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:11:09,806][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:11:09,807][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:11:10,741][__main__][INFO] - Iteration 16 took 18s (21.90% Gen, 72.96% Train). Generation: 3s, Training: 13s. Estimated remaining time: 15h 0m 36s. Estimated total time: 15h 7m 58s. Time estimates for 10 more iterations: 3m 1s, 100 more iterations: 30m 15s, 500 more iterations: 2h 31m 19s. [2025-11-13 08:11:10,743][__main__][INFO] - Starting iteration 16. [2025-11-13 08:11:10,746][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:11:10,746][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:11:14,685][__main__][INFO] - Number of regex retries in iteration 16: 0 [2025-11-13 08:11:14,686][__main__][INFO] - agents played in iteration 16 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:11:15,119][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:15,161][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:15,202][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:15,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:15,243][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:11:15,244][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:11:15,913][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:11:16,211][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:11:16,539][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:11:16,869][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:11:17,196][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:11:17,523][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:11:17,850][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:11:18,177][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:11:18,504][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:11:18,832][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:11:19,160][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:11:19,489][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:11:19,819][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:11:20,152][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:11:20,479][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:11:20,806][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:11:21,133][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:11:21,461][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:11:21,788][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:11:22,114][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:11:22,441][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:11:22,772][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:11:23,106][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:11:23,445][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:11:23,771][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:11:24,100][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:11:24,431][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:11:24,765][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:11:25,091][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:11:25,421][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:11:25,748][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:11:26,077][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:11:26,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:11:27,104][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:11:27,841][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:11:27,843][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:11:27,845][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:11:29,086][__main__][INFO] - Iteration 17 took 18s (21.48% Gen, 71.75% Train). Generation: 3s, Training: 13s. Estimated remaining time: 15h 9m 24s. Estimated total time: 15h 17m 3s. Time estimates for 10 more iterations: 3m 3s, 100 more iterations: 30m 34s, 500 more iterations: 2h 32m 50s. [2025-11-13 08:11:29,088][__main__][INFO] - Starting iteration 17. [2025-11-13 08:11:29,091][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:11:29,091][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:11:33,061][__main__][INFO] - Number of regex retries in iteration 17: 0 [2025-11-13 08:11:33,061][__main__][INFO] - agents played in iteration 17 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:11:33,492][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:33,533][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:33,574][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:33,614][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:33,615][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:11:33,616][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:11:34,280][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:11:34,578][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:11:34,908][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:11:35,237][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:11:35,568][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:11:35,894][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:11:36,221][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:11:36,547][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:11:36,875][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:11:37,205][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:11:37,531][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:11:37,859][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:11:38,188][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:11:38,516][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:11:38,843][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:11:39,170][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:11:39,497][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:11:39,823][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:11:40,153][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:11:40,480][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:11:40,808][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:11:41,136][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:11:41,468][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:11:41,795][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:11:42,123][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:11:42,450][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:11:42,776][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:11:43,106][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:11:43,433][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:11:43,760][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:11:44,089][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:11:44,420][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:11:44,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:11:45,438][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:11:46,169][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:11:46,171][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:11:46,172][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:11:47,183][__main__][INFO] - Iteration 18 took 18s (21.94% Gen, 72.46% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 56m 42s. Estimated total time: 15h 4m 40s. Time estimates for 10 more iterations: 3m 0s, 100 more iterations: 30m 9s, 500 more iterations: 2h 30m 46s. [2025-11-13 08:11:47,190][__main__][INFO] - Starting iteration 18. [2025-11-13 08:11:47,193][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:11:47,194][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:11:51,165][__main__][INFO] - Number of regex retries in iteration 18: 0 [2025-11-13 08:11:51,165][__main__][INFO] - agents played in iteration 18 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:11:51,603][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:51,644][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:51,685][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:51,726][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:51,727][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:11:51,727][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:11:52,416][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:11:52,713][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:11:53,040][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:11:53,370][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:11:53,697][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:11:54,024][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:11:54,352][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:11:54,678][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:11:55,005][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:11:55,331][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:11:55,665][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:11:55,994][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:11:56,322][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:11:56,649][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:11:56,974][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:11:57,303][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:11:57,631][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:11:57,958][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:11:58,285][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:11:58,613][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:11:58,938][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:11:59,271][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:11:59,598][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:11:59,925][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:12:00,253][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:12:00,581][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:12:00,907][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:12:01,234][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:12:01,561][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:12:01,891][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:12:02,221][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:12:02,551][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:12:02,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:12:03,588][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:12:04,326][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:12:04,327][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:12:04,329][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:12:05,365][__main__][INFO] - Iteration 19 took 18s (21.85% Gen, 72.44% Train). Generation: 3s, Training: 13s. Estimated remaining time: 15h 0m 22s. Estimated total time: 15h 8m 39s. Time estimates for 10 more iterations: 3m 1s, 100 more iterations: 30m 17s, 500 more iterations: 2h 31m 26s. [2025-11-13 08:12:05,367][__main__][INFO] - Starting iteration 19. [2025-11-13 08:12:05,370][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:12:05,371][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:12:09,372][__main__][INFO] - Number of regex retries in iteration 19: 0 [2025-11-13 08:12:09,372][__main__][INFO] - agents played in iteration 19 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:12:09,806][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:09,848][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:09,889][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:09,930][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:09,931][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:12:09,931][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:12:10,606][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:12:10,902][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:12:11,230][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:12:11,556][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:12:11,883][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:12:12,212][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:12:12,538][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:12:12,872][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:12:13,202][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:12:13,537][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:12:13,865][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:12:14,193][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:12:14,522][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:12:14,852][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:12:15,180][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:12:15,507][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:12:15,834][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:12:16,166][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:12:16,496][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:12:16,825][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:12:17,156][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:12:17,482][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:12:17,811][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:12:18,139][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:12:18,465][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:12:18,792][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:12:19,122][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:12:19,449][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:12:19,776][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:12:20,103][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:12:20,429][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:12:20,756][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:12:21,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:12:21,814][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:12:22,545][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:12:22,547][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:12:22,549][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:12:23,532][__main__][INFO] - Iteration 20 took 18s (22.03% Gen, 72.55% Train). Generation: 4s, Training: 13s. Estimated remaining time: 14h 59m 33s. Estimated total time: 15h 8m 8s. Time estimates for 10 more iterations: 3m 1s, 100 more iterations: 30m 16s, 500 more iterations: 2h 31m 21s. [2025-11-13 08:12:23,534][__main__][INFO] - Starting iteration 20. [2025-11-13 08:12:23,536][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:12:23,537][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:12:27,542][__main__][INFO] - Number of regex retries in iteration 20: 0 [2025-11-13 08:12:27,543][__main__][INFO] - agents played in iteration 20 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:12:27,980][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:28,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:28,063][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:28,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:28,105][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:12:28,105][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:12:28,778][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:12:29,212][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:12:29,545][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:12:29,873][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:12:30,200][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:12:30,530][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:12:30,865][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:12:31,195][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:12:31,526][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:12:31,861][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:12:32,190][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:12:32,521][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:12:32,849][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:12:33,176][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:12:33,503][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:12:33,835][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:12:34,166][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:12:34,496][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:12:34,823][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:12:35,149][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:12:35,479][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:12:35,807][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:12:36,134][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:12:36,462][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:12:36,792][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:12:37,122][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:12:37,450][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:12:37,779][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:12:38,109][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:12:38,440][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:12:38,768][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:12:39,096][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:12:39,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:12:40,136][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:12:40,898][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:12:40,899][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:12:40,910][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:12:43,085][__main__][INFO] - Iteration 21 took 19s (20.49% Gen, 68.37% Train). Generation: 4s, Training: 13s. Estimated remaining time: 16h 8m 35s. Estimated total time: 16h 17m 29s. Time estimates for 10 more iterations: 3m 15s, 100 more iterations: 32m 34s, 500 more iterations: 2h 42m 54s. [2025-11-13 08:12:43,087][__main__][INFO] - Starting iteration 21. [2025-11-13 08:12:43,090][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:12:43,091][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:12:47,230][__main__][INFO] - Number of regex retries in iteration 21: 0 [2025-11-13 08:12:47,231][__main__][INFO] - agents played in iteration 21 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:12:47,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:47,711][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:47,751][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:47,791][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:47,791][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:12:47,792][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:12:48,456][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:12:48,754][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:12:49,082][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:12:49,412][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:12:49,739][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:12:50,072][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:12:50,401][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:12:50,732][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:12:51,059][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:12:51,391][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:12:51,727][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:12:52,057][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:12:52,390][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:12:52,719][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:12:53,054][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:12:53,384][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:12:53,711][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:12:54,038][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:12:54,364][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:12:54,691][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:12:55,020][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:12:55,346][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:12:55,673][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:12:56,001][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:12:56,328][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:12:56,655][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:12:56,985][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:12:57,319][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:12:57,649][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:12:57,979][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:12:58,309][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:12:58,639][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:12:58,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:12:59,660][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:13:00,409][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:13:00,410][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:13:00,413][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:13:01,321][__main__][INFO] - Iteration 22 took 18s (22.71% Gen, 72.31% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 2m 22s. Estimated total time: 15h 11m 34s. Time estimates for 10 more iterations: 3m 2s, 100 more iterations: 30m 23s, 500 more iterations: 2h 31m 55s. [2025-11-13 08:13:01,323][__main__][INFO] - Starting iteration 22. [2025-11-13 08:13:01,326][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:13:01,326][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:13:05,210][__main__][INFO] - Number of regex retries in iteration 22: 0 [2025-11-13 08:13:05,210][__main__][INFO] - agents played in iteration 22 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:13:05,648][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:05,690][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:05,731][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:05,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:05,773][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:13:05,773][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:13:06,434][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:13:06,733][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:13:07,062][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:13:07,390][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:13:07,717][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:13:08,046][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:13:08,379][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:13:08,707][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:13:09,034][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:13:09,364][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:13:09,691][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:13:10,018][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:13:10,345][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:13:10,675][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:13:11,001][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:13:11,329][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:13:11,657][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:13:11,983][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:13:12,311][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:13:12,638][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:13:12,968][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:13:13,295][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:13:13,622][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:13:13,950][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:13:14,277][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:13:14,605][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:13:14,934][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:13:15,264][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:13:15,594][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:13:15,925][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:13:16,253][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:13:16,584][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:13:16,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:13:17,616][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:13:18,352][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:13:18,353][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:13:18,355][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:13:19,502][__main__][INFO] - Iteration 23 took 18s (21.37% Gen, 72.31% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 59m 21s. Estimated total time: 15h 8m 51s. Time estimates for 10 more iterations: 3m 1s, 100 more iterations: 30m 17s, 500 more iterations: 2h 31m 28s. [2025-11-13 08:13:19,504][__main__][INFO] - Starting iteration 23. [2025-11-13 08:13:19,507][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:13:19,508][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:13:23,445][__main__][INFO] - Number of regex retries in iteration 23: 0 [2025-11-13 08:13:23,446][__main__][INFO] - agents played in iteration 23 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:13:23,897][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:23,940][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:23,981][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:24,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:24,022][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:13:24,022][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:13:24,695][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:13:24,992][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:13:25,320][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:13:25,650][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:13:25,979][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:13:26,306][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:13:26,636][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:13:26,964][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:13:27,292][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:13:27,618][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:13:27,948][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:13:28,278][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:13:28,607][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:13:28,935][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:13:29,264][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:13:29,599][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:13:29,930][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:13:30,260][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:13:30,591][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:13:30,919][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:13:31,246][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:13:31,580][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:13:31,916][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:13:32,250][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:13:32,581][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:13:32,908][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:13:33,238][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:13:33,567][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:13:33,898][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:13:34,228][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:13:34,558][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:13:34,890][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:13:35,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:13:35,925][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:13:36,661][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:13:36,662][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:13:36,666][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:13:37,570][__main__][INFO] - Iteration 24 took 18s (21.80% Gen, 73.19% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 53m 24s. Estimated total time: 15h 3m 12s. Time estimates for 10 more iterations: 3m 0s, 100 more iterations: 30m 6s, 500 more iterations: 2h 30m 32s. [2025-11-13 08:13:37,573][__main__][INFO] - Starting iteration 24. [2025-11-13 08:13:37,576][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:13:37,576][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:13:41,463][__main__][INFO] - Number of regex retries in iteration 24: 0 [2025-11-13 08:13:41,464][__main__][INFO] - agents played in iteration 24 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:13:41,897][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:41,938][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:41,979][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:42,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:42,021][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:13:42,021][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:13:42,706][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:13:43,003][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:13:43,331][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:13:43,658][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:13:43,986][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:13:44,313][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:13:44,640][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:13:44,970][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:13:45,299][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:13:45,630][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:13:45,959][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:13:46,289][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:13:46,622][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:13:46,952][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:13:47,278][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:13:47,605][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:13:47,938][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:13:48,268][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:13:48,597][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:13:48,927][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:13:49,254][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:13:49,587][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:13:49,914][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:13:50,244][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:13:50,573][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:13:50,899][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:13:51,226][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:13:51,554][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:13:51,881][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:13:52,214][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:13:52,544][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:13:52,872][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:13:53,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:13:53,903][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:13:54,639][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:13:54,640][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:13:54,642][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:13:55,830][__main__][INFO] - Iteration 25 took 18s (21.29% Gen, 72.19% Train). Generation: 3s, Training: 13s. Estimated remaining time: 15h 2m 40s. Estimated total time: 15h 12m 47s. Time estimates for 10 more iterations: 3m 2s, 100 more iterations: 30m 25s, 500 more iterations: 2h 32m 7s. [2025-11-13 08:13:55,832][__main__][INFO] - Starting iteration 25. [2025-11-13 08:13:55,835][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:13:55,836][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:13:59,757][__main__][INFO] - Number of regex retries in iteration 25: 0 [2025-11-13 08:13:59,758][__main__][INFO] - agents played in iteration 25 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:14:00,196][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:00,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:00,278][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:00,319][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:00,319][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:14:00,320][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:14:00,993][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:14:01,290][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:14:01,618][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:14:01,945][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:14:02,274][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:14:02,601][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:14:02,927][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:14:03,254][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:14:03,580][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:14:03,906][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:14:04,233][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:14:04,559][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:14:04,885][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:14:05,215][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:14:05,544][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:14:05,870][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:14:06,200][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:14:06,526][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:14:06,853][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:14:07,184][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:14:07,510][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:14:07,837][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:14:08,168][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:14:08,495][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:14:08,824][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:14:09,156][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:14:09,491][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:14:09,824][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:14:10,154][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:14:10,484][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:14:10,814][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:14:11,145][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:14:11,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:14:12,197][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:14:12,956][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:14:12,958][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:14:12,960][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:14:13,961][__main__][INFO] - Iteration 26 took 18s (21.64% Gen, 72.83% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 55m 54s. Estimated total time: 15h 6m 19s. Time estimates for 10 more iterations: 3m 1s, 100 more iterations: 30m 12s, 500 more iterations: 2h 31m 3s. [2025-11-13 08:14:13,968][__main__][INFO] - Starting iteration 26. [2025-11-13 08:14:13,971][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:14:13,971][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:14:17,899][__main__][INFO] - Number of regex retries in iteration 26: 0 [2025-11-13 08:14:17,900][__main__][INFO] - agents played in iteration 26 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:14:18,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:18,372][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:18,413][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:18,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:18,454][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:14:18,455][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:14:19,148][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:14:19,445][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:14:19,775][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:14:20,104][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:14:20,431][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:14:20,764][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:14:21,096][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:14:21,426][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:14:21,754][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:14:22,083][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:14:22,411][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:14:22,737][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:14:23,065][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:14:23,395][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:14:23,726][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:14:24,054][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:14:24,384][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:14:24,711][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:14:25,037][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:14:25,368][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:14:25,695][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:14:26,022][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:14:26,350][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:14:26,677][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:14:27,007][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:14:27,333][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:14:27,662][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:14:27,990][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:14:28,317][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:14:28,644][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:14:28,970][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:14:29,298][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:14:29,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:14:30,329][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:14:31,059][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:14:31,061][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:14:31,062][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:14:32,029][__main__][INFO] - Iteration 27 took 18s (21.75% Gen, 72.89% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 52m 15s. Estimated total time: 15h 2m 57s. Time estimates for 10 more iterations: 3m 0s, 100 more iterations: 30m 5s, 500 more iterations: 2h 30m 29s. [2025-11-13 08:14:32,031][__main__][INFO] - Starting iteration 27. [2025-11-13 08:14:32,034][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:14:32,034][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:14:36,022][__main__][INFO] - Number of regex retries in iteration 27: 0 [2025-11-13 08:14:36,023][__main__][INFO] - agents played in iteration 27 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:14:36,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:36,492][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:36,534][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:36,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:36,576][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:14:36,576][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:14:37,279][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:14:37,576][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:14:37,910][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:14:38,236][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:14:38,565][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:14:38,892][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:14:39,219][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:14:39,549][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:14:39,879][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:14:40,212][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:14:40,543][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:14:40,872][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:14:41,198][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:14:41,525][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:14:41,851][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:14:42,179][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:14:42,505][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:14:42,835][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:14:43,161][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:14:43,490][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:14:43,820][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:14:44,146][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:14:44,473][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:14:44,802][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:14:45,131][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:14:45,461][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:14:45,790][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:14:46,122][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:14:46,449][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:14:46,778][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:14:47,107][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:14:47,435][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:14:47,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:14:48,466][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:14:49,210][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:14:49,212][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:14:49,214][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:14:50,261][__main__][INFO] - Iteration 28 took 18s (21.88% Gen, 72.36% Train). Generation: 3s, Training: 13s. Estimated remaining time: 15h 0m 24s. Estimated total time: 15h 11m 25s. Time estimates for 10 more iterations: 3m 2s, 100 more iterations: 30m 22s, 500 more iterations: 2h 31m 54s. [2025-11-13 08:14:50,264][__main__][INFO] - Starting iteration 28. [2025-11-13 08:14:50,267][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:14:50,267][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:14:54,303][__main__][INFO] - Number of regex retries in iteration 28: 0 [2025-11-13 08:14:54,303][__main__][INFO] - agents played in iteration 28 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:14:54,733][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:54,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:54,817][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:54,858][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:54,859][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:14:54,859][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:14:55,564][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:14:55,862][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:14:56,191][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:14:56,524][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:14:56,858][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:14:57,190][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:14:57,523][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:14:57,858][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:14:58,186][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:14:58,511][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:14:58,839][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:14:59,168][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:14:59,498][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:14:59,831][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:15:00,165][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:15:00,496][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:15:00,824][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:15:01,151][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:15:01,480][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:15:01,809][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:15:02,142][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:15:02,471][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:15:02,802][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:15:03,131][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:15:03,461][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:15:03,791][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:15:04,120][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:15:04,451][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:15:04,782][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:15:05,111][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:15:05,442][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:15:05,771][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:15:06,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:15:06,804][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:15:07,547][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:15:07,548][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:15:07,550][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:15:08,491][__main__][INFO] - Iteration 29 took 18s (22.15% Gen, 72.68% Train). Generation: 4s, Training: 13s. Estimated remaining time: 14h 59m 56s. Estimated total time: 15h 11m 15s. Time estimates for 10 more iterations: 3m 2s, 100 more iterations: 30m 22s, 500 more iterations: 2h 31m 52s. [2025-11-13 08:15:08,493][__main__][INFO] - Starting iteration 29. [2025-11-13 08:15:08,496][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:15:08,496][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:15:12,543][__main__][INFO] - Number of regex retries in iteration 29: 0 [2025-11-13 08:15:12,543][__main__][INFO] - agents played in iteration 29 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:15:12,990][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:13,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:13,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:13,114][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:13,114][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:15:13,115][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:15:13,831][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:15:14,129][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:15:14,456][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:15:14,784][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:15:15,113][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:15:15,444][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:15:15,773][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:15:16,102][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:15:16,429][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:15:16,763][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:15:17,094][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:15:17,421][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:15:17,750][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:15:18,079][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:15:18,409][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:15:18,737][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:15:19,065][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:15:19,395][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:15:19,722][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:15:20,049][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:15:20,376][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:15:20,706][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:15:21,037][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:15:21,371][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:15:21,701][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:15:22,030][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:15:22,358][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:15:22,687][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:15:23,018][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:15:23,348][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:15:23,676][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:15:24,002][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:15:24,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:15:25,060][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:15:25,824][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:15:25,827][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:15:25,828][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:15:26,806][__main__][INFO] - Iteration 30 took 18s (22.10% Gen, 72.55% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 3m 55s. Estimated total time: 15h 15m 32s. Time estimates for 10 more iterations: 3m 3s, 100 more iterations: 30m 31s, 500 more iterations: 2h 32m 35s. [2025-11-13 08:15:26,808][__main__][INFO] - Starting iteration 30. [2025-11-13 08:15:26,810][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:15:26,811][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:15:30,851][__main__][INFO] - Number of regex retries in iteration 30: 0 [2025-11-13 08:15:30,852][__main__][INFO] - agents played in iteration 30 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:15:31,281][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:31,323][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:31,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:31,406][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:31,406][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:15:31,406][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:15:32,115][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:15:32,412][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:15:32,748][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:15:33,081][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:15:33,410][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:15:33,745][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:15:34,081][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:15:34,411][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:15:34,741][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:15:35,071][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:15:35,397][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:15:35,730][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:15:36,061][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:15:36,388][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:15:36,722][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:15:37,049][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:15:37,378][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:15:37,708][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:15:38,038][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:15:38,364][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:15:38,691][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:15:39,019][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:15:39,347][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:15:39,674][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:15:40,003][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:15:40,330][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:15:40,660][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:15:40,986][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:15:41,316][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:15:41,644][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:15:41,973][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:15:42,300][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:15:42,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:15:43,331][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:15:44,077][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:15:44,079][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:15:44,080][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:15:46,052][__main__][INFO] - Iteration 31 took 19s (21.00% Gen, 68.75% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 50m 9s. Estimated total time: 16h 2m 6s. Time estimates for 10 more iterations: 3m 12s, 100 more iterations: 32m 4s, 500 more iterations: 2h 40m 21s. [2025-11-13 08:15:46,054][__main__][INFO] - Starting iteration 31. [2025-11-13 08:15:46,057][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:15:46,057][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:15:50,688][__main__][INFO] - Number of regex retries in iteration 31: 0 [2025-11-13 08:15:50,689][__main__][INFO] - agents played in iteration 31 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:15:51,126][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:51,167][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:51,208][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:51,248][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:51,249][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:15:51,249][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:15:51,960][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:15:52,257][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:15:52,584][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:15:52,918][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:15:53,247][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:15:53,577][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:15:53,908][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:15:54,241][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:15:54,572][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:15:54,905][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:15:55,236][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:15:55,565][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:15:55,894][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:15:56,221][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:15:56,548][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:15:56,876][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:15:57,202][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:15:57,529][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:15:57,856][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:15:58,183][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:15:58,510][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:15:58,837][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:15:59,164][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:15:59,491][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:15:59,818][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:16:00,146][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:16:00,475][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:16:00,801][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:16:01,131][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:16:01,459][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:16:01,792][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:16:02,121][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:16:02,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:16:03,143][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:16:03,881][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:16:03,883][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:16:03,884][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:16:04,875][__main__][INFO] - Iteration 32 took 18s (24.61% Gen, 70.12% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 28m 41s. Estimated total time: 15h 40m 57s. Time estimates for 10 more iterations: 3m 8s, 100 more iterations: 31m 21s, 500 more iterations: 2h 36m 49s. [2025-11-13 08:16:04,877][__main__][INFO] - Starting iteration 32. [2025-11-13 08:16:04,880][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:16:04,881][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:16:09,420][__main__][INFO] - Number of regex retries in iteration 32: 0 [2025-11-13 08:16:09,421][__main__][INFO] - agents played in iteration 32 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:16:09,901][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:09,943][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:09,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:10,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:10,025][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:16:10,025][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:16:10,735][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:16:11,033][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:16:11,360][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:16:11,686][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:16:12,015][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:16:12,341][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:16:12,672][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:16:13,003][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:16:13,331][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:16:13,657][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:16:13,986][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:16:14,312][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:16:14,642][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:16:14,977][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:16:15,307][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:16:15,634][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:16:15,960][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:16:16,287][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:16:16,615][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:16:16,948][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:16:17,282][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:16:17,613][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:16:17,938][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:16:18,264][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:16:18,591][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:16:18,918][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:16:19,249][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:16:19,577][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:16:19,903][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:16:20,229][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:16:20,557][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:16:20,884][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:16:21,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:16:21,915][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:16:22,652][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:16:22,654][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:16:22,656][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:16:23,630][__main__][INFO] - Iteration 33 took 18s (24.21% Gen, 70.58% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 24m 58s. Estimated total time: 15h 37m 32s. Time estimates for 10 more iterations: 3m 7s, 100 more iterations: 31m 15s, 500 more iterations: 2h 36m 15s. [2025-11-13 08:16:23,633][__main__][INFO] - Starting iteration 33. [2025-11-13 08:16:23,636][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:16:23,637][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:16:28,130][__main__][INFO] - Number of regex retries in iteration 33: 0 [2025-11-13 08:16:28,131][__main__][INFO] - agents played in iteration 33 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:16:28,566][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:28,607][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:28,648][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:28,689][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:28,690][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:16:28,690][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:16:29,418][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:16:29,718][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:16:30,046][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:16:30,376][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:16:30,705][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:16:31,040][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:16:31,370][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:16:31,696][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:16:32,025][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:16:32,355][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:16:32,682][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:16:33,013][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:16:33,343][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:16:33,669][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:16:33,996][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:16:34,323][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:16:34,649][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:16:34,979][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:16:35,306][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:16:35,634][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:16:35,963][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:16:36,291][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:16:36,620][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:16:36,945][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:16:37,272][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:16:37,598][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:16:37,928][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:16:38,258][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:16:38,585][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:16:38,912][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:16:39,242][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:16:39,569][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:16:39,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:16:40,595][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:16:41,343][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:16:41,344][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:16:41,346][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:16:42,345][__main__][INFO] - Iteration 34 took 18s (24.02% Gen, 70.63% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 22m 35s. Estimated total time: 15h 35m 28s. Time estimates for 10 more iterations: 3m 7s, 100 more iterations: 31m 10s, 500 more iterations: 2h 35m 54s. [2025-11-13 08:16:42,347][__main__][INFO] - Starting iteration 34. [2025-11-13 08:16:42,350][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:16:42,351][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:16:46,881][__main__][INFO] - Number of regex retries in iteration 34: 0 [2025-11-13 08:16:46,882][__main__][INFO] - agents played in iteration 34 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:16:47,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:47,386][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:47,427][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:47,468][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:47,469][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:16:47,469][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:16:48,188][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:16:48,487][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:16:48,815][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:16:49,143][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:16:49,471][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:16:49,798][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:16:50,125][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:16:50,455][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:16:50,786][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:16:51,116][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:16:51,446][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:16:51,772][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:16:52,100][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:16:52,427][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:16:52,754][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:16:53,080][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:16:53,407][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:16:53,734][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:16:54,062][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:16:54,392][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:16:54,726][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:16:55,061][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:16:55,395][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:16:55,724][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:16:56,054][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:16:56,388][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:16:56,723][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:16:57,059][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:16:57,385][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:16:57,715][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:16:58,044][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:16:58,374][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:16:58,710][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:16:59,419][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:17:00,168][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:17:00,169][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:17:00,171][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:17:01,156][__main__][INFO] - Iteration 35 took 18s (24.09% Gen, 70.66% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 27m 7s. Estimated total time: 15h 40m 19s. Time estimates for 10 more iterations: 3m 8s, 100 more iterations: 31m 20s, 500 more iterations: 2h 36m 43s. [2025-11-13 08:17:01,159][__main__][INFO] - Starting iteration 35. [2025-11-13 08:17:01,163][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:17:01,163][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:17:05,680][__main__][INFO] - Number of regex retries in iteration 35: 0 [2025-11-13 08:17:05,681][__main__][INFO] - agents played in iteration 35 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:17:06,127][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:06,167][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:06,208][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:06,250][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:06,251][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:17:06,251][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:17:06,981][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:17:07,280][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:17:07,609][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:17:07,939][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:17:08,271][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:17:08,602][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:17:08,930][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:17:09,256][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:17:09,583][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:17:09,913][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:17:10,240][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:17:10,566][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:17:10,893][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:17:11,223][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:17:11,550][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:17:11,876][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:17:12,206][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:17:12,537][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:17:12,863][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:17:13,192][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:17:13,521][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:17:13,851][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:17:14,180][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:17:14,509][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:17:14,844][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:17:15,174][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:17:15,501][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:17:15,828][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:17:16,155][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:17:16,485][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:17:16,815][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:17:17,142][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:17:17,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:17:18,179][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:17:19,055][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:17:19,056][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:17:19,068][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:17:20,305][__main__][INFO] - Iteration 36 took 19s (23.60% Gen, 69.93% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 43m 40s. Estimated total time: 15h 57m 11s. Time estimates for 10 more iterations: 3m 11s, 100 more iterations: 31m 54s, 500 more iterations: 2h 39m 31s. [2025-11-13 08:17:20,307][__main__][INFO] - Starting iteration 36. [2025-11-13 08:17:20,310][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:17:20,310][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:17:24,649][__main__][INFO] - Number of regex retries in iteration 36: 0 [2025-11-13 08:17:24,649][__main__][INFO] - agents played in iteration 36 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:17:25,080][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:25,121][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:25,163][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:25,204][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:25,205][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:17:25,205][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:17:25,929][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:17:26,229][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:17:26,563][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:17:26,891][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:17:27,220][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:17:27,548][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:17:27,877][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:17:28,204][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:17:28,535][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:17:28,865][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:17:29,195][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:17:29,528][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:17:29,860][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:17:30,190][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:17:30,519][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:17:30,847][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:17:31,181][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:17:31,515][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:17:31,843][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:17:32,169][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:17:32,501][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:17:32,829][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:17:33,155][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:17:33,483][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:17:33,810][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:17:34,140][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:17:34,468][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:17:34,795][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:17:35,122][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:17:35,449][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:17:35,778][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:17:36,105][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:17:36,437][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:17:37,121][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:17:37,854][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:17:37,855][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:17:37,857][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:17:38,792][__main__][INFO] - Iteration 37 took 18s (23.47% Gen, 71.46% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 10m 20s. Estimated total time: 15h 24m 9s. Time estimates for 10 more iterations: 3m 4s, 100 more iterations: 30m 48s, 500 more iterations: 2h 34m 1s. [2025-11-13 08:17:38,799][__main__][INFO] - Starting iteration 37. [2025-11-13 08:17:38,801][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:17:38,802][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:17:43,126][__main__][INFO] - Number of regex retries in iteration 37: 0 [2025-11-13 08:17:43,127][__main__][INFO] - agents played in iteration 37 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:17:43,563][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:43,604][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:43,645][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:43,686][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:43,687][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:17:43,687][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:17:44,407][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:17:44,707][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:17:45,037][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:17:45,365][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:17:45,693][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:17:46,020][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:17:46,347][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:17:46,673][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:17:47,000][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:17:47,328][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:17:47,655][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:17:47,982][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:17:48,309][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:17:48,639][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:17:48,966][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:17:49,292][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:17:49,620][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:17:49,947][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:17:50,275][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:17:50,601][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:17:50,929][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:17:51,257][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:17:51,586][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:17:51,917][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:17:52,245][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:17:52,576][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:17:52,905][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:17:53,235][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:17:53,562][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:17:53,891][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:17:54,218][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:17:54,544][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:17:54,875][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:17:55,571][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:17:56,318][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:17:56,320][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:17:56,321][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:17:57,318][__main__][INFO] - Iteration 38 took 18s (23.35% Gen, 71.26% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 11m 44s. Estimated total time: 15h 25m 52s. Time estimates for 10 more iterations: 3m 5s, 100 more iterations: 30m 51s, 500 more iterations: 2h 34m 18s. [2025-11-13 08:17:57,321][__main__][INFO] - Starting iteration 38. [2025-11-13 08:17:57,325][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:17:57,325][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:18:01,740][__main__][INFO] - Number of regex retries in iteration 38: 0 [2025-11-13 08:18:01,741][__main__][INFO] - agents played in iteration 38 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:18:02,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:02,217][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:02,259][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:02,299][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:02,300][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:18:02,300][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:18:03,043][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:18:03,343][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:18:03,675][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:18:04,007][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:18:04,342][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:18:04,669][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:18:05,003][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:18:05,337][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:18:05,671][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:18:06,002][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:18:06,332][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:18:06,664][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:18:06,995][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:18:07,324][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:18:07,654][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:18:07,984][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:18:08,320][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:18:08,648][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:18:08,981][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:18:09,310][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:18:09,640][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:18:09,971][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:18:10,301][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:18:10,631][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:18:10,961][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:18:11,294][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:18:11,624][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:18:11,953][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:18:12,281][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:18:12,608][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:18:12,935][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:18:13,266][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:18:13,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:18:14,279][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:18:15,019][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:18:15,021][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:18:15,023][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:18:16,177][__main__][INFO] - Iteration 39 took 18s (23.42% Gen, 70.45% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 28m 12s. Estimated total time: 15h 42m 39s. Time estimates for 10 more iterations: 3m 8s, 100 more iterations: 31m 25s, 500 more iterations: 2h 37m 6s. [2025-11-13 08:18:16,185][__main__][INFO] - Starting iteration 39. [2025-11-13 08:18:16,189][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:18:16,189][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:18:20,493][__main__][INFO] - Number of regex retries in iteration 39: 0 [2025-11-13 08:18:20,493][__main__][INFO] - agents played in iteration 39 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:18:20,931][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:20,972][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:21,013][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:21,054][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:21,054][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:18:21,055][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:18:21,767][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:18:22,065][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:18:22,393][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:18:22,720][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:18:23,049][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:18:23,374][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:18:23,703][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:18:24,030][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:18:24,358][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:18:24,685][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:18:25,013][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:18:25,341][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:18:25,667][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:18:25,993][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:18:26,319][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:18:26,647][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:18:26,975][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:18:27,302][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:18:27,629][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:18:27,958][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:18:28,287][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:18:28,618][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:18:28,945][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:18:29,271][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:18:29,599][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:18:29,926][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:18:30,255][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:18:30,584][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:18:30,911][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:18:31,237][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:18:31,564][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:18:31,891][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:18:32,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:18:32,923][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:18:33,662][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:18:33,664][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:18:33,665][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:18:34,872][__main__][INFO] - Iteration 40 took 18s (23.03% Gen, 70.50% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 19m 29s. Estimated total time: 15h 34m 14s. Time estimates for 10 more iterations: 3m 6s, 100 more iterations: 31m 8s, 500 more iterations: 2h 35m 42s. [2025-11-13 08:18:34,874][__main__][INFO] - Starting iteration 40. [2025-11-13 08:18:34,877][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:18:34,877][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:18:39,239][__main__][INFO] - Number of regex retries in iteration 40: 0 [2025-11-13 08:18:39,240][__main__][INFO] - agents played in iteration 40 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:18:39,676][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:39,717][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:39,758][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:39,799][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:39,800][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:18:39,800][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:18:40,516][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:18:40,817][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:18:41,144][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:18:41,478][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:18:41,812][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:18:42,140][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:18:42,466][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:18:42,793][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:18:43,119][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:18:43,446][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:18:43,773][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:18:44,100][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:18:44,429][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:18:44,756][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:18:45,082][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:18:45,408][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:18:45,737][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:18:46,064][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:18:46,391][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:18:46,719][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:18:47,047][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:18:47,373][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:18:47,700][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:18:48,028][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:18:48,355][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:18:48,684][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:18:49,011][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:18:49,338][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:18:49,666][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:18:49,993][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:18:50,320][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:18:50,647][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:18:50,974][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:18:51,683][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:18:52,421][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:18:52,422][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:18:52,424][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:18:54,381][__main__][INFO] - Iteration 41 took 19s (22.36% Gen, 67.59% Train). Generation: 4s, Training: 13s. Estimated remaining time: 16h 0m 8s. Estimated total time: 16h 15m 13s. Time estimates for 10 more iterations: 3m 15s, 100 more iterations: 32m 30s, 500 more iterations: 2h 42m 32s. [2025-11-13 08:18:54,383][__main__][INFO] - Starting iteration 41. [2025-11-13 08:18:54,386][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:18:54,386][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:18:59,343][__main__][INFO] - Number of regex retries in iteration 41: 0 [2025-11-13 08:18:59,343][__main__][INFO] - agents played in iteration 41 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:18:59,790][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:59,830][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:59,870][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:59,911][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:59,912][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:18:59,912][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:19:00,647][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:19:00,946][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:19:01,274][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:19:01,602][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:19:01,929][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:19:02,256][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:19:02,583][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:19:02,914][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:19:03,243][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:19:03,570][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:19:03,899][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:19:04,226][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:19:04,553][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:19:04,880][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:19:05,206][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:19:05,536][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:19:05,864][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:19:06,192][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:19:06,518][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:19:06,848][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:19:07,175][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:19:07,505][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:19:07,833][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:19:08,161][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:19:08,488][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:19:08,815][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:19:09,143][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:19:09,473][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:19:09,802][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:19:10,132][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:19:10,460][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:19:10,790][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:19:11,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:19:11,830][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:19:12,566][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:19:12,567][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:19:12,569][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:19:13,590][__main__][INFO] - Iteration 42 took 19s (25.81% Gen, 68.86% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 44m 50s. Estimated total time: 16h 0m 15s. Time estimates for 10 more iterations: 3m 12s, 100 more iterations: 32m 0s, 500 more iterations: 2h 40m 2s. [2025-11-13 08:19:13,592][__main__][INFO] - Starting iteration 42. [2025-11-13 08:19:13,595][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:19:13,595][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:19:18,405][__main__][INFO] - Number of regex retries in iteration 42: 0 [2025-11-13 08:19:18,406][__main__][INFO] - agents played in iteration 42 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:19:18,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:18,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:18,921][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:18,963][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:18,963][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:19:18,963][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:19:19,682][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:19:19,981][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:19:20,309][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:19:20,636][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:19:20,965][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:19:21,293][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:19:21,620][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:19:21,950][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:19:22,278][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:19:22,604][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:19:22,932][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:19:23,259][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:19:23,586][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:19:23,913][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:19:24,243][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:19:24,571][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:19:24,899][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:19:25,226][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:19:25,553][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:19:25,880][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:19:26,211][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:19:26,538][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:19:26,869][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:19:27,199][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:19:27,527][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:19:27,853][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:19:28,180][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:19:28,507][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:19:28,837][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:19:29,164][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:19:29,491][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:19:29,818][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:19:30,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:19:30,850][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:19:31,577][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:19:31,579][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:19:31,580][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:19:32,599][__main__][INFO] - Iteration 43 took 19s (25.31% Gen, 69.32% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 34m 33s. Estimated total time: 15h 50m 17s. Time estimates for 10 more iterations: 3m 10s, 100 more iterations: 31m 40s, 500 more iterations: 2h 38m 22s. [2025-11-13 08:19:32,607][__main__][INFO] - Starting iteration 43. [2025-11-13 08:19:32,610][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:19:32,611][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:19:37,412][__main__][INFO] - Number of regex retries in iteration 43: 0 [2025-11-13 08:19:37,412][__main__][INFO] - agents played in iteration 43 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:19:37,854][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:37,897][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:37,939][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:37,981][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:37,981][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:19:37,982][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:19:38,712][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:19:39,015][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:19:39,343][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:19:39,671][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:19:40,001][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:19:40,330][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:19:40,658][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:19:40,984][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:19:41,312][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:19:41,640][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:19:41,968][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:19:42,301][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:19:42,629][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:19:42,956][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:19:43,283][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:19:43,611][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:19:43,942][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:19:44,268][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:19:44,595][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:19:44,924][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:19:45,251][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:19:45,578][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:19:45,905][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:19:46,231][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:19:46,559][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:19:46,886][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:19:47,213][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:19:47,540][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:19:47,870][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:19:48,197][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:19:48,524][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:19:48,852][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:19:49,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:19:49,893][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:19:50,633][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:19:50,634][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:19:50,636][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:19:51,783][__main__][INFO] - Iteration 44 took 19s (25.04% Gen, 68.97% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 42m 38s. Estimated total time: 15h 58m 41s. Time estimates for 10 more iterations: 3m 11s, 100 more iterations: 31m 57s, 500 more iterations: 2h 39m 46s. [2025-11-13 08:19:51,786][__main__][INFO] - Starting iteration 44. [2025-11-13 08:19:51,789][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:19:51,790][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:19:56,756][__main__][INFO] - Number of regex retries in iteration 44: 0 [2025-11-13 08:19:56,756][__main__][INFO] - agents played in iteration 44 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:19:57,196][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:57,238][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:57,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:57,320][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:57,321][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:19:57,321][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:19:58,046][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:19:58,345][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:19:58,675][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:19:59,006][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:19:59,333][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:19:59,660][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:19:59,988][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:20:00,315][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:20:00,642][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:20:00,973][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:20:01,303][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:20:01,632][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:20:01,960][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:20:02,289][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:20:02,617][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:20:02,944][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:20:03,271][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:20:03,599][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:20:03,927][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:20:04,255][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:20:04,583][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:20:04,912][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:20:05,239][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:20:05,569][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:20:05,899][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:20:06,229][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:20:06,557][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:20:06,884][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:20:07,212][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:20:07,539][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:20:07,872][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:20:08,200][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:20:08,528][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:20:09,234][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:20:09,983][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:20:09,984][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:20:09,986][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:20:11,007][__main__][INFO] - Iteration 45 took 19s (25.84% Gen, 68.84% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 44m 34s. Estimated total time: 16h 0m 56s. Time estimates for 10 more iterations: 3m 12s, 100 more iterations: 32m 1s, 500 more iterations: 2h 40m 9s. [2025-11-13 08:20:11,009][__main__][INFO] - Starting iteration 45. [2025-11-13 08:20:11,012][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:20:11,012][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:20:15,859][__main__][INFO] - Number of regex retries in iteration 45: 0 [2025-11-13 08:20:15,860][__main__][INFO] - agents played in iteration 45 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:20:16,295][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:16,337][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:16,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:16,420][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:16,420][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:20:16,421][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:20:17,162][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:20:17,460][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:20:17,790][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:20:18,120][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:20:18,450][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:20:18,782][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:20:19,111][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:20:19,446][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:20:19,776][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:20:20,103][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:20:20,431][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:20:20,761][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:20:21,093][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:20:21,427][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:20:21,755][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:20:22,085][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:20:22,413][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:20:22,743][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:20:23,077][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:20:23,410][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:20:23,738][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:20:24,070][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:20:24,398][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:20:24,729][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:20:25,057][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:20:25,392][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:20:25,727][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:20:26,059][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:20:26,386][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:20:26,716][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:20:27,050][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:20:27,382][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:20:27,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:20:28,445][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:20:29,189][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:20:29,191][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:20:29,192][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:20:30,127][__main__][INFO] - Iteration 46 took 19s (25.36% Gen, 69.74% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 39m 8s. Estimated total time: 15h 55m 49s. Time estimates for 10 more iterations: 3m 11s, 100 more iterations: 31m 51s, 500 more iterations: 2h 39m 18s. [2025-11-13 08:20:30,133][__main__][INFO] - Starting iteration 46. [2025-11-13 08:20:30,136][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:20:30,136][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:20:34,931][__main__][INFO] - Number of regex retries in iteration 46: 0 [2025-11-13 08:20:34,932][__main__][INFO] - agents played in iteration 46 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:20:35,372][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:35,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:35,452][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:35,492][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:35,493][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:20:35,493][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:20:36,208][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:20:36,507][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:20:36,834][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:20:37,163][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:20:37,498][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:20:37,829][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:20:38,159][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:20:38,488][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:20:38,817][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:20:39,145][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:20:39,471][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:20:39,799][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:20:40,127][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:20:40,455][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:20:40,782][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:20:41,110][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:20:41,441][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:20:41,767][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:20:42,095][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:20:42,421][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:20:42,748][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:20:43,076][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:20:43,405][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:20:43,733][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:20:44,061][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:20:44,389][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:20:44,716][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:20:45,044][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:20:45,373][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:20:45,700][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:20:46,027][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:20:46,356][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:20:46,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:20:47,383][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:20:48,118][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:20:48,120][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:20:48,121][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:20:49,165][__main__][INFO] - Iteration 47 took 19s (25.20% Gen, 69.31% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 34m 31s. Estimated total time: 15h 51m 31s. Time estimates for 10 more iterations: 3m 10s, 100 more iterations: 31m 43s, 500 more iterations: 2h 38m 35s. [2025-11-13 08:20:49,168][__main__][INFO] - Starting iteration 47. [2025-11-13 08:20:49,171][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:20:49,171][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:20:53,950][__main__][INFO] - Number of regex retries in iteration 47: 0 [2025-11-13 08:20:53,951][__main__][INFO] - agents played in iteration 47 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:20:54,394][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:54,435][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:54,476][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:54,517][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:54,518][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:20:54,518][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:20:55,245][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:20:55,545][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:20:55,873][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:20:56,202][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:20:56,530][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:20:56,859][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:20:57,188][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:20:57,516][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:20:57,843][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:20:58,171][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:20:58,499][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:20:58,826][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:20:59,153][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:20:59,481][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:20:59,809][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:21:00,137][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:21:00,464][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:21:00,792][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:21:01,122][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:21:01,449][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:21:01,776][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:21:02,103][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:21:02,431][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:21:02,759][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:21:03,086][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:21:03,417][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:21:03,746][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:21:04,073][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:21:04,402][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:21:04,730][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:21:05,057][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:21:05,384][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:21:05,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:21:06,424][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:21:07,180][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:21:07,182][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:21:07,183][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:21:08,226][__main__][INFO] - Iteration 48 took 19s (25.08% Gen, 69.44% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 35m 28s. Estimated total time: 15h 52m 47s. Time estimates for 10 more iterations: 3m 10s, 100 more iterations: 31m 45s, 500 more iterations: 2h 38m 47s. [2025-11-13 08:21:08,229][__main__][INFO] - Starting iteration 48. [2025-11-13 08:21:08,231][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:21:08,232][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:21:13,051][__main__][INFO] - Number of regex retries in iteration 48: 0 [2025-11-13 08:21:13,052][__main__][INFO] - agents played in iteration 48 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:21:13,491][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:13,532][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:13,573][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:13,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:13,616][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:21:13,616][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:21:14,342][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:21:14,640][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:21:14,968][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:21:15,302][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:21:15,632][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:21:15,961][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:21:16,289][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:21:16,618][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:21:16,947][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:21:17,274][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:21:17,602][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:21:17,930][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:21:18,257][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:21:18,584][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:21:18,916][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:21:19,245][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:21:19,575][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:21:19,903][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:21:20,230][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:21:20,558][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:21:20,886][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:21:21,214][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:21:21,542][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:21:21,870][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:21:22,198][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:21:22,525][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:21:22,851][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:21:23,179][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:21:23,506][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:21:23,834][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:21:24,163][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:21:24,494][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:21:24,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:21:25,548][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:21:26,321][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:21:26,323][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:21:26,324][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:21:27,379][__main__][INFO] - Iteration 49 took 19s (25.17% Gen, 69.31% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 39m 48s. Estimated total time: 15h 57m 26s. Time estimates for 10 more iterations: 3m 11s, 100 more iterations: 31m 54s, 500 more iterations: 2h 39m 34s. [2025-11-13 08:21:27,382][__main__][INFO] - Starting iteration 49. [2025-11-13 08:21:27,385][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:21:27,386][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:21:32,151][__main__][INFO] - Number of regex retries in iteration 49: 0 [2025-11-13 08:21:32,152][__main__][INFO] - agents played in iteration 49 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:21:32,587][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:32,629][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:32,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:32,712][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:32,713][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:21:32,713][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:21:33,425][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:21:33,724][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:21:34,053][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:21:34,381][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:21:34,708][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:21:35,036][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:21:35,363][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:21:35,690][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:21:36,017][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:21:36,347][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:21:36,674][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:21:37,001][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:21:37,329][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:21:37,657][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:21:37,987][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:21:38,314][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:21:38,642][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:21:38,972][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:21:39,306][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:21:39,638][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:21:39,972][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:21:40,302][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:21:40,631][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:21:40,957][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:21:41,283][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:21:41,613][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:21:41,939][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:21:42,266][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:21:42,594][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:21:42,921][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:21:43,249][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:21:43,575][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:21:43,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:21:44,604][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:21:45,338][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:21:45,339][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:21:45,342][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:21:46,356][__main__][INFO] - Iteration 50 took 18s (25.12% Gen, 69.52% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 30m 36s. Estimated total time: 15h 48m 34s. Time estimates for 10 more iterations: 3m 9s, 100 more iterations: 31m 37s, 500 more iterations: 2h 38m 5s. [2025-11-13 08:21:46,358][__main__][INFO] - Starting iteration 50. [2025-11-13 08:21:46,361][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:21:46,362][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:21:51,219][__main__][INFO] - Number of regex retries in iteration 50: 0 [2025-11-13 08:21:51,220][__main__][INFO] - agents played in iteration 50 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:21:51,654][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:51,696][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:51,737][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:51,778][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:51,779][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:21:51,780][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:21:52,500][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:21:52,800][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:21:53,129][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:21:53,457][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:21:53,788][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:21:54,115][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:21:54,443][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:21:54,774][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:21:55,102][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:21:55,430][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:21:55,757][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:21:56,088][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:21:56,419][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:21:56,753][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:21:57,081][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:21:57,412][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:21:57,740][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:21:58,074][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:21:58,401][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:21:58,731][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:21:59,058][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:21:59,393][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:21:59,724][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:22:00,055][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:22:00,383][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:22:00,715][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:22:01,050][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:22:01,385][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:22:01,713][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:22:02,042][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:22:02,371][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:22:02,703][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:22:03,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:22:03,746][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:22:04,482][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:22:04,484][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:22:04,485][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:22:06,481][__main__][INFO] - Iteration 51 took 20s (24.14% Gen, 65.93% Train). Generation: 4s, Training: 13s. Estimated remaining time: 16h 27m 45s. Estimated total time: 16h 46m 2s. Time estimates for 10 more iterations: 3m 21s, 100 more iterations: 33m 32s, 500 more iterations: 2h 47m 40s. [2025-11-13 08:22:06,484][__main__][INFO] - Starting iteration 51. [2025-11-13 08:22:06,487][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:22:06,487][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:22:11,794][__main__][INFO] - Number of regex retries in iteration 51: 0 [2025-11-13 08:22:11,795][__main__][INFO] - agents played in iteration 51 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:22:12,236][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:12,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:12,318][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:12,360][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:12,360][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:22:12,361][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:22:13,091][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:22:13,391][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:22:13,720][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:22:14,051][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:22:14,381][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:22:14,716][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:22:15,051][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:22:15,382][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:22:15,716][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:22:16,048][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:22:16,384][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:22:16,713][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:22:17,042][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:22:17,370][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:22:17,698][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:22:18,026][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:22:18,354][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:22:18,681][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:22:19,009][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:22:19,337][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:22:19,665][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:22:19,993][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:22:20,321][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:22:20,649][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:22:20,977][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:22:21,304][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:22:21,633][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:22:21,960][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:22:22,289][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:22:22,617][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:22:22,945][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:22:23,272][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:22:23,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:22:24,324][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:22:25,237][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:22:25,239][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:22:25,251][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:22:26,478][__main__][INFO] - Iteration 52 took 19s (26.55% Gen, 67.31% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 20m 57s. Estimated total time: 16h 39m 35s. Time estimates for 10 more iterations: 3m 19s, 100 more iterations: 33m 19s, 500 more iterations: 2h 46m 35s. [2025-11-13 08:22:26,480][__main__][INFO] - Starting iteration 52. [2025-11-13 08:22:26,482][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:22:26,483][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:22:31,497][__main__][INFO] - Number of regex retries in iteration 52: 0 [2025-11-13 08:22:31,498][__main__][INFO] - agents played in iteration 52 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:22:31,927][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:31,969][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:32,008][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:32,049][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:32,049][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:22:32,050][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:22:32,762][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:22:33,060][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:22:33,388][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:22:33,715][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:22:34,042][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:22:34,370][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:22:34,705][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:22:35,036][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:22:35,364][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:22:35,693][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:22:36,022][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:22:36,348][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:22:36,679][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:22:37,009][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:22:37,339][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:22:37,669][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:22:37,998][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:22:38,328][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:22:38,660][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:22:38,995][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:22:39,325][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:22:39,655][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:22:39,983][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:22:40,310][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:22:40,640][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:22:40,968][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:22:41,297][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:22:41,626][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:22:41,954][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:22:42,282][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:22:42,610][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:22:42,938][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:22:43,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:22:43,969][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:22:44,700][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:22:44,701][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:22:44,703][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:22:45,715][__main__][INFO] - Iteration 53 took 19s (26.07% Gen, 68.66% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 42m 43s. Estimated total time: 16h 1m 39s. Time estimates for 10 more iterations: 3m 12s, 100 more iterations: 32m 3s, 500 more iterations: 2h 40m 16s. [2025-11-13 08:22:45,717][__main__][INFO] - Starting iteration 53. [2025-11-13 08:22:45,720][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:22:45,720][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:22:50,880][__main__][INFO] - Number of regex retries in iteration 53: 0 [2025-11-13 08:22:50,881][__main__][INFO] - agents played in iteration 53 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:22:51,314][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:51,356][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:51,397][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:51,438][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:51,438][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:22:51,439][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:22:52,183][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:22:52,482][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:22:52,811][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:22:53,140][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:22:53,469][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:22:53,797][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:22:54,125][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:22:54,452][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:22:54,779][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:22:55,107][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:22:55,435][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:22:55,763][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:22:56,093][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:22:56,420][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:22:56,748][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:22:57,076][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:22:57,404][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:22:57,733][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:22:58,067][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:22:58,402][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:22:58,735][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:22:59,063][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:22:59,394][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:22:59,726][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:23:00,056][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:23:00,386][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:23:00,717][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:23:01,045][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:23:01,373][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:23:01,702][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:23:02,033][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:23:02,364][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:23:02,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:23:03,396][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:23:04,140][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:23:04,142][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:23:04,143][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:23:05,180][__main__][INFO] - Iteration 54 took 19s (26.52% Gen, 68.15% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 53m 47s. Estimated total time: 16h 13m 3s. Time estimates for 10 more iterations: 3m 14s, 100 more iterations: 32m 26s, 500 more iterations: 2h 42m 10s. [2025-11-13 08:23:05,182][__main__][INFO] - Starting iteration 54. [2025-11-13 08:23:05,185][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:23:05,186][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:23:10,393][__main__][INFO] - Number of regex retries in iteration 54: 0 [2025-11-13 08:23:10,393][__main__][INFO] - agents played in iteration 54 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:23:10,827][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:10,869][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:10,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:10,950][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:10,950][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:23:10,950][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:23:11,683][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:23:11,981][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:23:12,310][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:23:12,640][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:23:12,969][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:23:13,296][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:23:13,624][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:23:13,952][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:23:14,280][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:23:14,609][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:23:14,937][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:23:15,265][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:23:15,594][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:23:15,924][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:23:16,253][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:23:16,584][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:23:16,918][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:23:17,245][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:23:17,574][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:23:17,901][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:23:18,229][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:23:18,557][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:23:18,887][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:23:19,214][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:23:19,542][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:23:19,872][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:23:20,198][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:23:20,529][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:23:20,857][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:23:21,191][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:23:21,522][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:23:21,849][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:23:22,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:23:22,884][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:23:23,633][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:23:23,634][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:23:23,636][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:23:24,646][__main__][INFO] - Iteration 55 took 19s (26.75% Gen, 68.04% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 53m 29s. Estimated total time: 16h 13m 4s. Time estimates for 10 more iterations: 3m 14s, 100 more iterations: 32m 26s, 500 more iterations: 2h 42m 10s. [2025-11-13 08:23:24,648][__main__][INFO] - Starting iteration 55. [2025-11-13 08:23:24,651][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:23:24,651][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:23:29,884][__main__][INFO] - Number of regex retries in iteration 55: 0 [2025-11-13 08:23:29,885][__main__][INFO] - agents played in iteration 55 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:23:30,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:30,365][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:30,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:30,445][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:30,445][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:23:30,446][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:23:31,171][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:23:31,470][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:23:31,801][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:23:32,131][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:23:32,459][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:23:32,794][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:23:33,124][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:23:33,451][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:23:33,783][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:23:34,112][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:23:34,447][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:23:34,779][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:23:35,109][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:23:35,438][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:23:35,769][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:23:36,100][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:23:36,437][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:23:36,773][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:23:37,109][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:23:37,437][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:23:37,768][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:23:38,097][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:23:38,426][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:23:38,753][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:23:39,084][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:23:39,414][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:23:39,748][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:23:40,075][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:23:40,403][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:23:40,738][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:23:41,073][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:23:41,400][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:23:41,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:23:42,441][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:23:43,181][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:23:43,182][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:23:43,184][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:23:44,365][__main__][INFO] - Iteration 56 took 19s (26.54% Gen, 67.46% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 5m 50s. Estimated total time: 16h 25m 45s. Time estimates for 10 more iterations: 3m 17s, 100 more iterations: 32m 51s, 500 more iterations: 2h 44m 17s. [2025-11-13 08:23:44,367][__main__][INFO] - Starting iteration 56. [2025-11-13 08:23:44,370][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:23:44,370][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:23:49,358][__main__][INFO] - Number of regex retries in iteration 56: 0 [2025-11-13 08:23:49,359][__main__][INFO] - agents played in iteration 56 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:23:49,794][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:49,835][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:49,876][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:49,917][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:49,918][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:23:49,918][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:23:50,635][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:23:50,934][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:23:51,268][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:23:51,598][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:23:51,926][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:23:52,254][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:23:52,581][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:23:52,910][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:23:53,237][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:23:53,565][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:23:53,893][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:23:54,219][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:23:54,547][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:23:54,875][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:23:55,202][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:23:55,530][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:23:55,859][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:23:56,186][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:23:56,513][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:23:56,840][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:23:57,168][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:23:57,496][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:23:57,825][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:23:58,153][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:23:58,480][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:23:58,808][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:23:59,134][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:23:59,463][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:23:59,791][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:24:00,117][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:24:00,444][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:24:00,773][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:24:01,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:24:01,801][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:24:02,530][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:24:02,532][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:24:02,533][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:24:03,542][__main__][INFO] - Iteration 57 took 19s (26.02% Gen, 68.71% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 38m 26s. Estimated total time: 15h 58m 40s. Time estimates for 10 more iterations: 3m 11s, 100 more iterations: 31m 57s, 500 more iterations: 2h 39m 46s. [2025-11-13 08:24:03,544][__main__][INFO] - Starting iteration 57. [2025-11-13 08:24:03,547][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:24:03,548][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:24:08,544][__main__][INFO] - Number of regex retries in iteration 57: 0 [2025-11-13 08:24:08,544][__main__][INFO] - agents played in iteration 57 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:24:08,984][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:09,027][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:09,068][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:09,109][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:09,109][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:24:09,110][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:24:09,847][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:24:10,146][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:24:10,476][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:24:10,810][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:24:11,141][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:24:11,469][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:24:11,796][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:24:12,124][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:24:12,450][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:24:12,778][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:24:13,105][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:24:13,432][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:24:13,761][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:24:14,088][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:24:14,420][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:24:14,756][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:24:15,087][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:24:15,417][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:24:15,752][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:24:16,082][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:24:16,416][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:24:16,746][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:24:17,074][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:24:17,402][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:24:17,730][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:24:18,058][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:24:18,385][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:24:18,712][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:24:19,041][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:24:19,369][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:24:19,698][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:24:20,025][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:24:20,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:24:21,069][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:24:21,824][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:24:21,826][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:24:21,828][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:24:22,862][__main__][INFO] - Iteration 58 took 19s (25.87% Gen, 68.77% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 45m 15s. Estimated total time: 16h 5m 48s. Time estimates for 10 more iterations: 3m 13s, 100 more iterations: 32m 11s, 500 more iterations: 2h 40m 58s. [2025-11-13 08:24:22,864][__main__][INFO] - Starting iteration 58. [2025-11-13 08:24:22,867][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:24:22,867][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:24:28,082][__main__][INFO] - Number of regex retries in iteration 58: 0 [2025-11-13 08:24:28,083][__main__][INFO] - agents played in iteration 58 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:24:28,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:28,563][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:28,603][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:28,643][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:28,644][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:24:28,644][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:24:29,371][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:24:29,668][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:24:29,997][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:24:30,327][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:24:30,656][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:24:30,984][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:24:31,312][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:24:31,644][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:24:31,975][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:24:32,306][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:24:32,636][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:24:32,964][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:24:33,292][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:24:33,621][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:24:33,951][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:24:34,279][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:24:34,608][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:24:34,936][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:24:35,267][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:24:35,598][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:24:35,930][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:24:36,260][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:24:36,591][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:24:36,920][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:24:37,248][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:24:37,576][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:24:37,904][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:24:38,232][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:24:38,561][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:24:38,895][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:24:39,226][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:24:39,555][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:24:39,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:24:40,677][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:24:41,434][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:24:41,435][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:24:41,437][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:24:42,437][__main__][INFO] - Iteration 59 took 19s (26.65% Gen, 68.23% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 57m 40s. Estimated total time: 16h 18m 33s. Time estimates for 10 more iterations: 3m 15s, 100 more iterations: 32m 37s, 500 more iterations: 2h 43m 5s. [2025-11-13 08:24:42,439][__main__][INFO] - Starting iteration 59. [2025-11-13 08:24:42,442][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:24:42,442][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:24:47,482][__main__][INFO] - Number of regex retries in iteration 59: 0 [2025-11-13 08:24:47,483][__main__][INFO] - agents played in iteration 59 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:24:47,916][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:47,957][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:47,998][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:48,039][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:48,039][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:24:48,040][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:24:48,779][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:24:49,078][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:24:49,406][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:24:49,734][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:24:50,062][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:24:50,391][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:24:50,718][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:24:51,047][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:24:51,375][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:24:51,704][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:24:52,033][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:24:52,362][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:24:52,690][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:24:53,017][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:24:53,345][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:24:53,673][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:24:54,001][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:24:54,328][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:24:54,657][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:24:54,984][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:24:55,313][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:24:55,643][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:24:55,972][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:24:56,300][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:24:56,628][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:24:56,956][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:24:57,287][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:24:57,616][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:24:57,943][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:24:58,271][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:24:58,599][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:24:58,927][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:24:59,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:24:59,973][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:25:00,711][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:25:00,713][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:25:00,715][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:25:01,747][__main__][INFO] - Iteration 60 took 19s (26.11% Gen, 68.54% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 44m 5s. Estimated total time: 16h 5m 18s. Time estimates for 10 more iterations: 3m 13s, 100 more iterations: 32m 10s, 500 more iterations: 2h 40m 53s. [2025-11-13 08:25:01,749][__main__][INFO] - Starting iteration 60. [2025-11-13 08:25:01,752][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:25:01,753][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:25:06,823][__main__][INFO] - Number of regex retries in iteration 60: 0 [2025-11-13 08:25:06,824][__main__][INFO] - agents played in iteration 60 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:25:07,270][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:07,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:07,349][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:07,389][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:07,390][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:25:07,390][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:25:08,117][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:25:08,414][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:25:08,743][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:25:09,071][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:25:09,398][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:25:09,727][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:25:10,061][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:25:10,389][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:25:10,719][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:25:11,047][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:25:11,378][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:25:11,705][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:25:12,035][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:25:12,369][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:25:12,695][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:25:13,023][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:25:13,357][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:25:13,690][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:25:14,018][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:25:14,348][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:25:14,678][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:25:15,008][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:25:15,339][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:25:15,670][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:25:16,007][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:25:16,342][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:25:16,670][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:25:17,004][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:25:17,339][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:25:17,670][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:25:17,997][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:25:18,326][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:25:18,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:25:19,387][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:25:20,133][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:25:20,134][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:25:20,136][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:25:22,141][__main__][INFO] - Iteration 61 took 20s (24.87% Gen, 65.29% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 37m 55s. Estimated total time: 16h 59m 28s. Time estimates for 10 more iterations: 3m 23s, 100 more iterations: 33m 58s, 500 more iterations: 2h 49m 54s. [2025-11-13 08:25:22,143][__main__][INFO] - Starting iteration 61. [2025-11-13 08:25:22,146][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:25:22,146][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:25:27,706][__main__][INFO] - Number of regex retries in iteration 61: 0 [2025-11-13 08:25:27,707][__main__][INFO] - agents played in iteration 61 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:25:28,142][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:28,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:28,223][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:28,263][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:28,263][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:25:28,264][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:25:28,985][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:25:29,284][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:25:29,612][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:25:29,941][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:25:30,270][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:25:30,597][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:25:30,926][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:25:31,253][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:25:31,584][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:25:31,912][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:25:32,240][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:25:32,574][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:25:32,904][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:25:33,238][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:25:33,568][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:25:33,902][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:25:34,229][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:25:34,558][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:25:34,887][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:25:35,221][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:25:35,556][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:25:35,890][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:25:36,217][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:25:36,546][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:25:36,874][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:25:37,201][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:25:37,529][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:25:37,857][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:25:38,185][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:25:38,512][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:25:38,840][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:25:39,168][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:25:39,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:25:40,203][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:25:40,933][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:25:40,934][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:25:40,936][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:25:42,040][__main__][INFO] - Iteration 62 took 19s (27.95% Gen, 66.49% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 12m 54s. Estimated total time: 16h 34m 47s. Time estimates for 10 more iterations: 3m 18s, 100 more iterations: 33m 9s, 500 more iterations: 2h 45m 47s. [2025-11-13 08:25:42,043][__main__][INFO] - Starting iteration 62. [2025-11-13 08:25:42,048][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:25:42,049][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:25:47,429][__main__][INFO] - Number of regex retries in iteration 62: 0 [2025-11-13 08:25:47,430][__main__][INFO] - agents played in iteration 62 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:25:47,865][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:47,905][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:47,945][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:47,985][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:47,985][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:25:47,986][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:25:48,707][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:25:49,005][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:25:49,339][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:25:49,667][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:25:49,996][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:25:50,322][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:25:50,649][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:25:50,984][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:25:51,314][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:25:51,643][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:25:51,970][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:25:52,302][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:25:52,629][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:25:52,956][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:25:53,283][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:25:53,611][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:25:53,939][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:25:54,266][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:25:54,593][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:25:54,922][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:25:55,250][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:25:55,578][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:25:55,906][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:25:56,233][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:25:56,561][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:25:56,891][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:25:57,217][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:25:57,545][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:25:57,873][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:25:58,203][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:25:58,531][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:25:58,859][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:25:59,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:25:59,905][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:26:00,635][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:26:00,636][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:26:00,638][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:26:01,612][__main__][INFO] - Iteration 63 took 19s (27.50% Gen, 67.51% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 56m 0s. Estimated total time: 16h 18m 13s. Time estimates for 10 more iterations: 3m 15s, 100 more iterations: 32m 36s, 500 more iterations: 2h 43m 2s. [2025-11-13 08:26:01,614][__main__][INFO] - Starting iteration 63. [2025-11-13 08:26:01,617][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:26:01,617][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:26:07,017][__main__][INFO] - Number of regex retries in iteration 63: 0 [2025-11-13 08:26:07,017][__main__][INFO] - agents played in iteration 63 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:26:07,456][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:07,496][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:07,536][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:07,576][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:07,576][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:26:07,577][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:26:08,320][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:26:08,617][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:26:08,947][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:26:09,279][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:26:09,610][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:26:09,938][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:26:10,266][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:26:10,597][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:26:10,927][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:26:11,255][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:26:11,586][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:26:11,915][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:26:12,243][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:26:12,572][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:26:12,903][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:26:13,232][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:26:13,560][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:26:13,896][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:26:14,231][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:26:14,560][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:26:14,890][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:26:15,220][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:26:15,547][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:26:15,874][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:26:16,202][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:26:16,530][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:26:16,858][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:26:17,186][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:26:17,514][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:26:17,842][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:26:18,171][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:26:18,502][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:26:18,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:26:19,542][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:26:20,282][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:26:20,283][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:26:20,285][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:26:21,254][__main__][INFO] - Iteration 64 took 19s (27.50% Gen, 67.56% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 59m 22s. Estimated total time: 16h 21m 54s. Time estimates for 10 more iterations: 3m 16s, 100 more iterations: 32m 43s, 500 more iterations: 2h 43m 39s. [2025-11-13 08:26:21,257][__main__][INFO] - Starting iteration 64. [2025-11-13 08:26:21,260][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:26:21,260][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:26:26,828][__main__][INFO] - Number of regex retries in iteration 64: 0 [2025-11-13 08:26:26,829][__main__][INFO] - agents played in iteration 64 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:26:27,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:27,308][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:27,348][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:27,389][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:27,389][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:26:27,389][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:26:28,124][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:26:28,422][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:26:28,751][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:26:29,078][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:26:29,406][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:26:29,734][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:26:30,062][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:26:30,390][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:26:30,717][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:26:31,045][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:26:31,373][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:26:31,701][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:26:32,029][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:26:32,357][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:26:32,685][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:26:33,013][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:26:33,340][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:26:33,668][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:26:33,997][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:26:34,328][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:26:34,655][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:26:34,984][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:26:35,314][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:26:35,643][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:26:35,971][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:26:36,300][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:26:36,628][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:26:36,956][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:26:37,286][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:26:37,615][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:26:37,942][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:26:38,274][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:26:38,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:26:39,319][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:26:40,058][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:26:40,060][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:26:40,061][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:26:41,326][__main__][INFO] - Iteration 65 took 20s (27.75% Gen, 65.94% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 20m 29s. Estimated total time: 16h 43m 22s. Time estimates for 10 more iterations: 3m 20s, 100 more iterations: 33m 26s, 500 more iterations: 2h 47m 13s. [2025-11-13 08:26:41,328][__main__][INFO] - Starting iteration 65. [2025-11-13 08:26:41,331][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:26:41,331][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:26:46,620][__main__][INFO] - Number of regex retries in iteration 65: 0 [2025-11-13 08:26:46,621][__main__][INFO] - agents played in iteration 65 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:26:47,056][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:47,096][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:47,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:47,177][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:47,177][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:26:47,178][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:26:47,894][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:26:48,192][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:26:48,521][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:26:48,853][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:26:49,180][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:26:49,508][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:26:49,836][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:26:50,164][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:26:50,492][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:26:50,821][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:26:51,148][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:26:51,476][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:26:51,804][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:26:52,132][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:26:52,462][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:26:52,797][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:26:53,127][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:26:53,456][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:26:53,784][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:26:54,113][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:26:54,443][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:26:54,774][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:26:55,102][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:26:55,429][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:26:55,757][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:26:56,084][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:26:56,413][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:26:56,743][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:26:57,074][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:26:57,403][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:26:57,732][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:26:58,060][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:26:58,389][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:26:59,097][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:26:59,828][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:26:59,829][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:26:59,831][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:27:00,832][__main__][INFO] - Iteration 66 took 19s (27.12% Gen, 67.74% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 51m 54s. Estimated total time: 16h 15m 6s. Time estimates for 10 more iterations: 3m 15s, 100 more iterations: 32m 30s, 500 more iterations: 2h 42m 31s. [2025-11-13 08:27:00,834][__main__][INFO] - Starting iteration 66. [2025-11-13 08:27:00,837][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:27:00,837][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:27:06,265][__main__][INFO] - Number of regex retries in iteration 66: 0 [2025-11-13 08:27:06,265][__main__][INFO] - agents played in iteration 66 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:27:06,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:06,744][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:06,785][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:06,826][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:06,826][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:27:06,827][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:27:07,562][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:27:07,860][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:27:08,188][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:27:08,516][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:27:08,844][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:27:09,179][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:27:09,510][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:27:09,839][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:27:10,166][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:27:10,494][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:27:10,822][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:27:11,150][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:27:11,479][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:27:11,807][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:27:12,135][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:27:12,466][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:27:12,794][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:27:13,121][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:27:13,449][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:27:13,777][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:27:14,104][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:27:14,432][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:27:14,761][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:27:15,089][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:27:15,423][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:27:15,759][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:27:16,090][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:27:16,419][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:27:16,748][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:27:17,083][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:27:17,414][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:27:17,742][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:27:18,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:27:18,776][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:27:19,513][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:27:19,514][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:27:19,516][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:27:20,657][__main__][INFO] - Iteration 67 took 19s (27.38% Gen, 66.85% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 7m 32s. Estimated total time: 16h 31m 3s. Time estimates for 10 more iterations: 3m 18s, 100 more iterations: 33m 2s, 500 more iterations: 2h 45m 10s. [2025-11-13 08:27:20,659][__main__][INFO] - Starting iteration 67. [2025-11-13 08:27:20,662][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:27:20,662][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:27:26,036][__main__][INFO] - Number of regex retries in iteration 67: 0 [2025-11-13 08:27:26,036][__main__][INFO] - agents played in iteration 67 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:27:26,468][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:26,508][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:26,548][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:26,588][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:26,589][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:27:26,589][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:27:27,314][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:27:27,612][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:27:27,941][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:27:28,270][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:27:28,597][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:27:28,927][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:27:29,255][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:27:29,589][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:27:29,918][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:27:30,245][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:27:30,579][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:27:30,909][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:27:31,241][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:27:31,568][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:27:31,896][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:27:32,224][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:27:32,553][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:27:32,880][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:27:33,209][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:27:33,537][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:27:33,864][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:27:34,192][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:27:34,520][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:27:34,847][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:27:35,174][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:27:35,504][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:27:35,831][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:27:36,165][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:27:36,494][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:27:36,828][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:27:37,158][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:27:37,486][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:27:37,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:27:38,530][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:27:39,260][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:27:39,262][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:27:39,264][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:27:40,233][__main__][INFO] - Iteration 68 took 19s (27.46% Gen, 67.58% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 54m 45s. Estimated total time: 16h 18m 36s. Time estimates for 10 more iterations: 3m 15s, 100 more iterations: 32m 37s, 500 more iterations: 2h 43m 6s. [2025-11-13 08:27:40,235][__main__][INFO] - Starting iteration 68. [2025-11-13 08:27:40,238][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:27:40,238][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:27:45,584][__main__][INFO] - Number of regex retries in iteration 68: 0 [2025-11-13 08:27:45,585][__main__][INFO] - agents played in iteration 68 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:27:46,023][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:46,065][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:46,106][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:46,147][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:46,148][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:27:46,148][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:27:46,881][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:27:47,179][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:27:47,512][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:27:47,846][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:27:48,174][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:27:48,505][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:27:48,833][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:27:49,161][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:27:49,489][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:27:49,815][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:27:50,144][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:27:50,473][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:27:50,800][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:27:51,128][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:27:51,459][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:27:51,790][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:27:52,117][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:27:52,445][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:27:52,774][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:27:53,101][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:27:53,429][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:27:53,757][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:27:54,087][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:27:54,414][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:27:54,745][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:27:55,072][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:27:55,400][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:27:55,730][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:27:56,058][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:27:56,389][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:27:56,721][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:27:57,055][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:27:57,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:27:58,104][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:27:58,857][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:27:58,859][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:27:58,861][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:28:00,004][__main__][INFO] - Iteration 69 took 19s (27.05% Gen, 67.16% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 4m 10s. Estimated total time: 16h 28m 21s. Time estimates for 10 more iterations: 3m 17s, 100 more iterations: 32m 56s, 500 more iterations: 2h 44m 43s. [2025-11-13 08:28:00,006][__main__][INFO] - Starting iteration 69. [2025-11-13 08:28:00,009][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:28:00,009][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:28:05,277][__main__][INFO] - Number of regex retries in iteration 69: 0 [2025-11-13 08:28:05,277][__main__][INFO] - agents played in iteration 69 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:28:05,723][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:05,765][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:05,806][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:05,846][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:05,847][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:28:05,847][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:28:06,575][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:28:06,874][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:28:07,203][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:28:07,534][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:28:07,868][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:28:08,196][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:28:08,530][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:28:08,858][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:28:09,186][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:28:09,515][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:28:09,849][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:28:10,176][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:28:10,507][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:28:10,838][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:28:11,169][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:28:11,500][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:28:11,830][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:28:12,160][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:28:12,488][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:28:12,816][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:28:13,144][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:28:13,471][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:28:13,802][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:28:14,133][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:28:14,461][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:28:14,788][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:28:15,119][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:28:15,449][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:28:15,780][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:28:16,111][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:28:16,441][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:28:16,770][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:28:17,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:28:17,796][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:28:18,530][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:28:18,531][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:28:18,533][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:28:19,656][__main__][INFO] - Iteration 70 took 19s (26.81% Gen, 67.46% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 57m 54s. Estimated total time: 16h 22m 24s. Time estimates for 10 more iterations: 3m 16s, 100 more iterations: 32m 44s, 500 more iterations: 2h 43m 44s. [2025-11-13 08:28:19,658][__main__][INFO] - Starting iteration 70. [2025-11-13 08:28:19,661][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:28:19,661][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:28:24,953][__main__][INFO] - Number of regex retries in iteration 70: 0 [2025-11-13 08:28:24,953][__main__][INFO] - agents played in iteration 70 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:28:25,391][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:25,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:25,474][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:25,515][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:25,516][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:28:25,516][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:28:26,248][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:28:26,547][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:28:26,876][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:28:27,204][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:28:27,532][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:28:27,860][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:28:28,190][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:28:28,521][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:28:28,850][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:28:29,179][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:28:29,507][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:28:29,837][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:28:30,166][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:28:30,493][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:28:30,821][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:28:31,149][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:28:31,477][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:28:31,806][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:28:32,136][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:28:32,463][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:28:32,794][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:28:33,129][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:28:33,457][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:28:33,785][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:28:34,120][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:28:34,455][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:28:34,783][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:28:35,112][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:28:35,440][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:28:35,773][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:28:36,113][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:28:36,440][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:28:36,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:28:37,493][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:28:38,245][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:28:38,246][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:28:38,248][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:28:40,199][__main__][INFO] - Iteration 71 took 20s (25.76% Gen, 64.73% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 42m 6s. Estimated total time: 17h 6m 57s. Time estimates for 10 more iterations: 3m 25s, 100 more iterations: 34m 13s, 500 more iterations: 2h 51m 9s. [2025-11-13 08:28:40,201][__main__][INFO] - Starting iteration 71. [2025-11-13 08:28:40,204][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:28:40,204][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:28:45,978][__main__][INFO] - Number of regex retries in iteration 71: 0 [2025-11-13 08:28:45,979][__main__][INFO] - agents played in iteration 71 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:28:46,420][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:46,461][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:46,502][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:46,543][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:46,543][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:28:46,544][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:28:47,288][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:28:47,587][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:28:47,917][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:28:48,251][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:28:48,587][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:28:48,915][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:28:49,251][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:28:49,581][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:28:49,913][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:28:50,242][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:28:50,570][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:28:50,900][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:28:51,234][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:28:51,562][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:28:51,892][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:28:52,219][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:28:52,549][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:28:52,886][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:28:53,220][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:28:53,550][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:28:53,883][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:28:54,213][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:28:54,541][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:28:54,869][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:28:55,199][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:28:55,527][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:28:55,856][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:28:56,185][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:28:56,513][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:28:56,843][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:28:57,175][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:28:57,503][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:28:57,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:28:58,540][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:28:59,294][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:28:59,296][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:28:59,298][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:29:00,299][__main__][INFO] - Iteration 72 took 20s (28.73% Gen, 66.28% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 19m 36s. Estimated total time: 16h 44m 47s. Time estimates for 10 more iterations: 3m 20s, 100 more iterations: 33m 29s, 500 more iterations: 2h 47m 27s. [2025-11-13 08:29:00,301][__main__][INFO] - Starting iteration 72. [2025-11-13 08:29:00,304][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:29:00,305][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:29:05,977][__main__][INFO] - Number of regex retries in iteration 72: 0 [2025-11-13 08:29:05,978][__main__][INFO] - agents played in iteration 72 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:29:06,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:06,476][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:06,517][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:06,558][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:06,559][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:29:06,559][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:29:07,287][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:29:07,587][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:29:07,921][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:29:08,251][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:29:08,579][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:29:08,914][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:29:09,244][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:29:09,574][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:29:09,902][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:29:10,237][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:29:10,566][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:29:10,894][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:29:11,223][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:29:11,551][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:29:11,879][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:29:12,207][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:29:12,536][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:29:12,864][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:29:13,194][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:29:13,522][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:29:13,849][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:29:14,177][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:29:14,504][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:29:14,832][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:29:15,163][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:29:15,492][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:29:15,823][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:29:16,151][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:29:16,479][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:29:16,807][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:29:17,136][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:29:17,463][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:29:17,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:29:18,511][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:29:19,263][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:29:19,269][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:29:19,270][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:29:20,282][__main__][INFO] - Iteration 73 took 19s (28.39% Gen, 66.53% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 13m 26s. Estimated total time: 16h 38m 57s. Time estimates for 10 more iterations: 3m 19s, 100 more iterations: 33m 17s, 500 more iterations: 2h 46m 29s. [2025-11-13 08:29:20,284][__main__][INFO] - Starting iteration 73. [2025-11-13 08:29:20,287][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:29:20,288][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:29:26,019][__main__][INFO] - Number of regex retries in iteration 73: 0 [2025-11-13 08:29:26,020][__main__][INFO] - agents played in iteration 73 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:29:26,462][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:26,503][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:26,544][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:26,585][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:26,586][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:29:26,586][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:29:27,317][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:29:27,617][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:29:27,947][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:29:28,278][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:29:28,612][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:29:28,941][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:29:29,269][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:29:29,597][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:29:29,925][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:29:30,252][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:29:30,582][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:29:30,910][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:29:31,239][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:29:31,566][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:29:31,893][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:29:32,220][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:29:32,555][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:29:32,887][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:29:33,216][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:29:33,544][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:29:33,872][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:29:34,200][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:29:34,528][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:29:34,856][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:29:35,184][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:29:35,517][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:29:35,845][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:29:36,174][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:29:36,501][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:29:36,829][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:29:37,157][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:29:37,487][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:29:37,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:29:38,550][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:29:39,288][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:29:39,289][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:29:39,291][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:29:40,373][__main__][INFO] - Iteration 74 took 20s (28.54% Gen, 66.07% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 18m 28s. Estimated total time: 16h 44m 19s. Time estimates for 10 more iterations: 3m 20s, 100 more iterations: 33m 28s, 500 more iterations: 2h 47m 23s. [2025-11-13 08:29:40,384][__main__][INFO] - Starting iteration 74. [2025-11-13 08:29:40,388][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:29:40,388][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:29:46,082][__main__][INFO] - Number of regex retries in iteration 74: 0 [2025-11-13 08:29:46,083][__main__][INFO] - agents played in iteration 74 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:29:46,523][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:46,564][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:46,604][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:46,645][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:46,645][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:29:46,646][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:29:47,396][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:29:47,695][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:29:48,024][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:29:48,353][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:29:48,685][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:29:49,021][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:29:49,350][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:29:49,685][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:29:50,013][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:29:50,341][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:29:50,669][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:29:50,997][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:29:51,326][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:29:51,657][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:29:51,988][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:29:52,315][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:29:52,643][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:29:52,972][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:29:53,300][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:29:53,629][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:29:53,956][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:29:54,287][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:29:54,618][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:29:54,945][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:29:55,275][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:29:55,602][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:29:55,931][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:29:56,258][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:29:56,586][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:29:56,914][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:29:57,242][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:29:57,576][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:29:57,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:29:58,641][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:29:59,397][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:29:59,399][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:29:59,400][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:30:00,572][__main__][INFO] - Iteration 75 took 20s (28.21% Gen, 65.98% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 23m 5s. Estimated total time: 16h 49m 17s. Time estimates for 10 more iterations: 3m 21s, 100 more iterations: 33m 38s, 500 more iterations: 2h 48m 12s. [2025-11-13 08:30:00,574][__main__][INFO] - Starting iteration 75. [2025-11-13 08:30:00,577][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:30:00,578][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:30:06,255][__main__][INFO] - Number of regex retries in iteration 75: 0 [2025-11-13 08:30:06,255][__main__][INFO] - agents played in iteration 75 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:30:06,690][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:06,731][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:06,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:06,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:06,813][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:30:06,813][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:30:07,552][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:30:07,850][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:30:08,180][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:30:08,510][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:30:08,838][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:30:09,174][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:30:09,503][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:30:09,835][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:30:10,164][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:30:10,493][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:30:10,824][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:30:11,153][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:30:11,484][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:30:11,812][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:30:12,139][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:30:12,468][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:30:12,798][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:30:13,126][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:30:13,457][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:30:13,787][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:30:14,118][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:30:14,452][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:30:14,787][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:30:15,118][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:30:15,449][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:30:15,778][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:30:16,113][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:30:16,441][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:30:16,769][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:30:17,097][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:30:17,425][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:30:17,753][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:30:18,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:30:18,811][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:30:19,569][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:30:19,570][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:30:19,572][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:30:20,528][__main__][INFO] - Iteration 76 took 19s (28.46% Gen, 66.75% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 11m 3s. Estimated total time: 16h 37m 34s. Time estimates for 10 more iterations: 3m 19s, 100 more iterations: 33m 15s, 500 more iterations: 2h 46m 15s. [2025-11-13 08:30:20,530][__main__][INFO] - Starting iteration 76. [2025-11-13 08:30:20,533][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:30:20,533][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:30:26,139][__main__][INFO] - Number of regex retries in iteration 76: 0 [2025-11-13 08:30:26,140][__main__][INFO] - agents played in iteration 76 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:30:26,588][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:26,630][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:26,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:26,712][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:26,712][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:30:26,713][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:30:27,442][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:30:27,742][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:30:28,071][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:30:28,398][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:30:28,727][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:30:29,056][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:30:29,387][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:30:29,716][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:30:30,043][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:30:30,371][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:30:30,700][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:30:31,029][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:30:31,360][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:30:31,688][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:30:32,016][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:30:32,344][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:30:32,671][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:30:32,999][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:30:33,327][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:30:33,655][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:30:33,985][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:30:34,312][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:30:34,644][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:30:34,971][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:30:35,303][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:30:35,633][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:30:35,964][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:30:36,292][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:30:36,621][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:30:36,951][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:30:37,280][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:30:37,609][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:30:37,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:30:38,673][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:30:39,423][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:30:39,424][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:30:39,426][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:30:40,428][__main__][INFO] - Iteration 77 took 19s (28.18% Gen, 66.78% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 7m 57s. Estimated total time: 16h 34m 48s. Time estimates for 10 more iterations: 3m 18s, 100 more iterations: 33m 9s, 500 more iterations: 2h 45m 48s. [2025-11-13 08:30:40,430][__main__][INFO] - Starting iteration 77. [2025-11-13 08:30:40,433][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:30:40,434][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:30:46,065][__main__][INFO] - Number of regex retries in iteration 77: 0 [2025-11-13 08:30:46,066][__main__][INFO] - agents played in iteration 77 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:30:46,505][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:46,546][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:46,587][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:46,629][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:46,629][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:30:46,630][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:30:47,362][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:30:47,660][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:30:47,989][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:30:48,318][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:30:48,649][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:30:48,977][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:30:49,305][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:30:49,633][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:30:49,961][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:30:50,289][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:30:50,621][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:30:50,949][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:30:51,279][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:30:51,608][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:30:51,937][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:30:52,265][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:30:52,593][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:30:52,922][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:30:53,251][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:30:53,579][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:30:53,906][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:30:54,234][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:30:54,562][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:30:54,891][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:30:55,222][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:30:55,550][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:30:55,878][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:30:56,209][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:30:56,536][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:30:56,864][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:30:57,195][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:30:57,523][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:30:57,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:30:58,579][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:30:59,336][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:30:59,338][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:30:59,339][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:31:00,380][__main__][INFO] - Iteration 78 took 19s (28.23% Gen, 66.54% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 10m 12s. Estimated total time: 16h 37m 24s. Time estimates for 10 more iterations: 3m 19s, 100 more iterations: 33m 14s, 500 more iterations: 2h 46m 14s. [2025-11-13 08:31:00,383][__main__][INFO] - Starting iteration 78. [2025-11-13 08:31:00,386][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:31:00,386][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:31:06,150][__main__][INFO] - Number of regex retries in iteration 78: 0 [2025-11-13 08:31:06,151][__main__][INFO] - agents played in iteration 78 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:31:06,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:06,637][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:06,678][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:06,719][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:06,719][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:31:06,720][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:31:07,451][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:31:07,751][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:31:08,081][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:31:08,417][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:31:08,752][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:31:09,079][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:31:09,414][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:31:09,745][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:31:10,074][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:31:10,405][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:31:10,731][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:31:11,059][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:31:11,391][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:31:11,719][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:31:12,047][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:31:12,378][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:31:12,707][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:31:13,034][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:31:13,362][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:31:13,692][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:31:14,021][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:31:14,350][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:31:14,678][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:31:15,009][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:31:15,337][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:31:15,665][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:31:15,993][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:31:16,321][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:31:16,649][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:31:16,976][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:31:17,304][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:31:17,632][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:31:17,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:31:18,675][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:31:19,418][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:31:19,420][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:31:19,422][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:31:20,444][__main__][INFO] - Iteration 79 took 20s (28.74% Gen, 66.16% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 15m 26s. Estimated total time: 16h 42m 57s. Time estimates for 10 more iterations: 3m 20s, 100 more iterations: 33m 25s, 500 more iterations: 2h 47m 9s. [2025-11-13 08:31:20,447][__main__][INFO] - Starting iteration 79. [2025-11-13 08:31:20,450][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:31:20,450][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:31:26,077][__main__][INFO] - Number of regex retries in iteration 79: 0 [2025-11-13 08:31:26,078][__main__][INFO] - agents played in iteration 79 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:31:26,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:26,562][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:26,604][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:26,645][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:26,646][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:31:26,647][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:31:27,382][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:31:27,682][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:31:28,016][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:31:28,345][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:31:28,679][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:31:29,010][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:31:29,339][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:31:29,668][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:31:29,996][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:31:30,324][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:31:30,653][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:31:30,982][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:31:31,309][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:31:31,637][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:31:31,965][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:31:32,293][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:31:32,622][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:31:32,951][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:31:33,278][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:31:33,605][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:31:33,933][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:31:34,261][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:31:34,592][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:31:34,920][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:31:35,249][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:31:35,577][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:31:35,906][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:31:36,235][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:31:36,563][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:31:36,891][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:31:37,219][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:31:37,546][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:31:37,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:31:38,588][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:31:39,326][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:31:39,327][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:31:39,329][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:31:40,325][__main__][INFO] - Iteration 80 took 19s (28.31% Gen, 66.67% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 5m 58s. Estimated total time: 16h 33m 49s. Time estimates for 10 more iterations: 3m 18s, 100 more iterations: 33m 7s, 500 more iterations: 2h 45m 38s. [2025-11-13 08:31:40,328][__main__][INFO] - Starting iteration 80. [2025-11-13 08:31:40,331][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:31:40,331][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:31:45,925][__main__][INFO] - Number of regex retries in iteration 80: 0 [2025-11-13 08:31:45,926][__main__][INFO] - agents played in iteration 80 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:31:46,360][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:46,400][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:46,442][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:46,483][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:46,483][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:31:46,484][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:31:47,229][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:31:47,529][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:31:47,859][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:31:48,193][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:31:48,521][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:31:48,848][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:31:49,183][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:31:49,510][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:31:49,841][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:31:50,176][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:31:50,504][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:31:50,834][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:31:51,167][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:31:51,502][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:31:51,834][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:31:52,162][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:31:52,491][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:31:52,826][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:31:53,164][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:31:53,496][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:31:53,827][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:31:54,164][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:31:54,492][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:31:54,821][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:31:55,150][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:31:55,478][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:31:55,809][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:31:56,140][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:31:56,468][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:31:56,796][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:31:57,124][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:31:57,452][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:31:57,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:31:58,488][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:31:59,228][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:31:59,229][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:31:59,231][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:32:01,280][__main__][INFO] - Iteration 81 took 20s (26.70% Gen, 63.51% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 59m 16s. Estimated total time: 17h 27m 28s. Time estimates for 10 more iterations: 3m 29s, 100 more iterations: 34m 54s, 500 more iterations: 2h 54m 34s. [2025-11-13 08:32:01,282][__main__][INFO] - Starting iteration 81. [2025-11-13 08:32:01,285][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:32:01,287][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:32:07,438][__main__][INFO] - Number of regex retries in iteration 81: 0 [2025-11-13 08:32:07,438][__main__][INFO] - agents played in iteration 81 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:32:07,883][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:07,924][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:07,965][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:08,006][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:08,007][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:32:08,007][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:32:08,738][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:32:09,037][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:32:09,367][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:32:09,695][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:32:10,022][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:32:10,352][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:32:10,681][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:32:11,009][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:32:11,337][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:32:11,664][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:32:11,992][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:32:12,319][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:32:12,647][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:32:12,976][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:32:13,304][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:32:13,632][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:32:13,960][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:32:14,289][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:32:14,616][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:32:14,945][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:32:15,272][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:32:15,600][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:32:15,928][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:32:16,256][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:32:16,583][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:32:16,912][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:32:17,240][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:32:17,571][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:32:17,899][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:32:18,227][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:32:18,555][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:32:18,882][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:32:19,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:32:19,927][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:32:20,677][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:32:20,679][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:32:20,681][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:32:21,855][__main__][INFO] - Iteration 82 took 20s (29.90% Gen, 64.37% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 39m 59s. Estimated total time: 17h 8m 31s. Time estimates for 10 more iterations: 3m 25s, 100 more iterations: 34m 17s, 500 more iterations: 2h 51m 25s. [2025-11-13 08:32:21,857][__main__][INFO] - Starting iteration 82. [2025-11-13 08:32:21,860][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:32:21,861][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:32:27,763][__main__][INFO] - Number of regex retries in iteration 82: 0 [2025-11-13 08:32:27,764][__main__][INFO] - agents played in iteration 82 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:32:28,200][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:28,241][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:28,282][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:28,323][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:28,323][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:32:28,323][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:32:29,050][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:32:29,349][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:32:29,679][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:32:30,007][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:32:30,336][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:32:30,666][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:32:30,995][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:32:31,325][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:32:31,652][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:32:31,980][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:32:32,309][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:32:32,636][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:32:32,965][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:32:33,295][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:32:33,624][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:32:33,952][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:32:34,282][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:32:34,611][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:32:34,939][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:32:35,267][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:32:35,595][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:32:35,924][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:32:36,253][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:32:36,581][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:32:36,911][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:32:37,242][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:32:37,571][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:32:37,905][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:32:38,233][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:32:38,561][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:32:38,889][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:32:39,219][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:32:39,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:32:40,265][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:32:40,989][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:32:40,991][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:32:40,993][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:32:42,118][__main__][INFO] - Iteration 83 took 20s (29.14% Gen, 65.30% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 24m 4s. Estimated total time: 16h 52m 57s. Time estimates for 10 more iterations: 3m 22s, 100 more iterations: 33m 45s, 500 more iterations: 2h 48m 49s. [2025-11-13 08:32:42,120][__main__][INFO] - Starting iteration 83. [2025-11-13 08:32:42,123][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:32:42,124][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:32:47,944][__main__][INFO] - Number of regex retries in iteration 83: 0 [2025-11-13 08:32:47,945][__main__][INFO] - agents played in iteration 83 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:32:48,389][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:48,429][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:48,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:48,512][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:48,513][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:32:48,513][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:32:49,241][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:32:49,540][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:32:49,870][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:32:50,200][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:32:50,530][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:32:50,860][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:32:51,188][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:32:51,516][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:32:51,848][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:32:52,175][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:32:52,503][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:32:52,830][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:32:53,159][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:32:53,490][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:32:53,820][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:32:54,148][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:32:54,483][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:32:54,811][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:32:55,139][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:32:55,467][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:32:55,795][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:32:56,123][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:32:56,454][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:32:56,784][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:32:57,115][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:32:57,444][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:32:57,773][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:32:58,100][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:32:58,431][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:32:58,759][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:32:59,088][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:32:59,416][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:32:59,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:33:00,455][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:33:01,191][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:33:01,192][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:33:01,194][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:33:02,144][__main__][INFO] - Iteration 84 took 20s (29.08% Gen, 66.17% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 11m 52s. Estimated total time: 16h 41m 5s. Time estimates for 10 more iterations: 3m 20s, 100 more iterations: 33m 22s, 500 more iterations: 2h 46m 50s. [2025-11-13 08:33:02,146][__main__][INFO] - Starting iteration 84. [2025-11-13 08:33:02,150][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:33:02,151][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:33:08,006][__main__][INFO] - Number of regex retries in iteration 84: 0 [2025-11-13 08:33:08,007][__main__][INFO] - agents played in iteration 84 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:33:08,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:08,511][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:08,552][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:08,592][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:08,593][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:33:08,594][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:33:09,317][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:33:09,617][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:33:09,946][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:33:10,282][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:33:10,617][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:33:10,945][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:33:11,275][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:33:11,603][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:33:11,935][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:33:12,270][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:33:12,598][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:33:12,933][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:33:13,263][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:33:13,594][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:33:13,921][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:33:14,250][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:33:14,577][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:33:14,909][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:33:15,243][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:33:15,574][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:33:15,908][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:33:16,238][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:33:16,566][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:33:16,894][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:33:17,222][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:33:17,550][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:33:17,879][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:33:18,209][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:33:18,540][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:33:18,869][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:33:19,203][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:33:19,533][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:33:19,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:33:20,571][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:33:21,471][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:33:21,472][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:33:21,488][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:33:22,705][__main__][INFO] - Iteration 85 took 20s (28.49% Gen, 65.58% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 38m 14s. Estimated total time: 17h 7m 48s. Time estimates for 10 more iterations: 3m 25s, 100 more iterations: 34m 15s, 500 more iterations: 2h 51m 18s. [2025-11-13 08:33:22,707][__main__][INFO] - Starting iteration 85. [2025-11-13 08:33:22,710][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:33:22,710][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:33:28,553][__main__][INFO] - Number of regex retries in iteration 85: 0 [2025-11-13 08:33:28,554][__main__][INFO] - agents played in iteration 85 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:33:28,990][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:29,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:29,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:29,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:29,113][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:33:29,113][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:33:29,858][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:33:30,155][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:33:30,490][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:33:30,821][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:33:31,156][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:33:31,484][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:33:31,819][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:33:32,150][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:33:32,484][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:33:32,816][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:33:33,148][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:33:33,475][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:33:33,802][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:33:34,134][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:33:34,466][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:33:34,795][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:33:35,124][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:33:35,455][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:33:35,789][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:33:36,122][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:33:36,452][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:33:36,782][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:33:37,113][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:33:37,441][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:33:37,770][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:33:38,099][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:33:38,426][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:33:38,755][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:33:39,083][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:33:39,414][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:33:39,741][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:33:40,071][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:33:40,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:33:41,126][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:33:41,852][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:33:41,853][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:33:41,855][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:33:42,829][__main__][INFO] - Iteration 86 took 20s (29.04% Gen, 66.11% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 16m 6s. Estimated total time: 16h 46m 0s. Time estimates for 10 more iterations: 3m 21s, 100 more iterations: 33m 32s, 500 more iterations: 2h 47m 40s. [2025-11-13 08:33:42,831][__main__][INFO] - Starting iteration 86. [2025-11-13 08:33:42,833][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:33:42,834][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:33:48,704][__main__][INFO] - Number of regex retries in iteration 86: 0 [2025-11-13 08:33:48,705][__main__][INFO] - agents played in iteration 86 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:33:49,148][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:49,190][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:49,230][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:49,271][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:49,271][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:33:49,272][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:33:50,009][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:33:50,306][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:33:50,636][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:33:50,967][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:33:51,298][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:33:51,628][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:33:51,959][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:33:52,287][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:33:52,616][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:33:52,945][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:33:53,274][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:33:53,605][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:33:53,934][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:33:54,263][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:33:54,594][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:33:54,925][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:33:55,253][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:33:55,581][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:33:55,910][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:33:56,237][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:33:56,566][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:33:56,894][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:33:57,222][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:33:57,549][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:33:57,877][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:33:58,205][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:33:58,533][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:33:58,861][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:33:59,189][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:33:59,517][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:33:59,846][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:34:00,174][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:34:00,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:34:01,251][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:34:02,006][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:34:02,007][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:34:02,009][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:34:03,044][__main__][INFO] - Iteration 87 took 20s (29.05% Gen, 65.83% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 20m 21s. Estimated total time: 16h 50m 34s. Time estimates for 10 more iterations: 3m 22s, 100 more iterations: 33m 41s, 500 more iterations: 2h 48m 25s. [2025-11-13 08:34:03,046][__main__][INFO] - Starting iteration 87. [2025-11-13 08:34:03,049][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:34:03,049][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:34:08,956][__main__][INFO] - Number of regex retries in iteration 87: 0 [2025-11-13 08:34:08,957][__main__][INFO] - agents played in iteration 87 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:34:09,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:09,447][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:09,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:09,528][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:09,529][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:34:09,530][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:34:10,271][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:34:10,569][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:34:10,897][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:34:11,225][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:34:11,554][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:34:11,880][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:34:12,207][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:34:12,536][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:34:12,866][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:34:13,194][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:34:13,522][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:34:13,850][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:34:14,178][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:34:14,506][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:34:14,836][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:34:15,165][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:34:15,493][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:34:15,820][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:34:16,152][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:34:16,480][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:34:16,808][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:34:17,135][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:34:17,462][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:34:17,789][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:34:18,120][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:34:18,450][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:34:18,780][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:34:19,109][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:34:19,437][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:34:19,765][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:34:20,093][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:34:20,421][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:34:20,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:34:21,476][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:34:22,227][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:34:22,228][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:34:22,230][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:34:23,202][__main__][INFO] - Iteration 88 took 20s (29.31% Gen, 65.86% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 17m 8s. Estimated total time: 16h 47m 42s. Time estimates for 10 more iterations: 3m 21s, 100 more iterations: 33m 35s, 500 more iterations: 2h 47m 57s. [2025-11-13 08:34:23,203][__main__][INFO] - Starting iteration 88. [2025-11-13 08:34:23,206][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:34:23,207][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:34:29,100][__main__][INFO] - Number of regex retries in iteration 88: 0 [2025-11-13 08:34:29,101][__main__][INFO] - agents played in iteration 88 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:34:29,540][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:29,583][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:29,623][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:29,664][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:29,664][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:34:29,665][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:34:30,397][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:34:30,696][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:34:31,025][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:34:31,355][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:34:31,685][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:34:32,012][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:34:32,340][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:34:32,670][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:34:33,001][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:34:33,331][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:34:33,660][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:34:33,989][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:34:34,321][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:34:34,653][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:34:34,983][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:34:35,311][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:34:35,645][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:34:35,973][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:34:36,303][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:34:36,631][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:34:36,958][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:34:37,286][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:34:37,614][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:34:37,943][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:34:38,272][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:34:38,600][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:34:38,928][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:34:39,256][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:34:39,584][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:34:39,912][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:34:40,241][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:34:40,568][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:34:40,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:34:41,609][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:34:42,342][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:34:42,344][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:34:42,345][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:34:43,327][__main__][INFO] - Iteration 89 took 20s (29.29% Gen, 65.82% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 15m 10s. Estimated total time: 16h 46m 4s. Time estimates for 10 more iterations: 3m 21s, 100 more iterations: 33m 32s, 500 more iterations: 2h 47m 40s. [2025-11-13 08:34:43,329][__main__][INFO] - Starting iteration 89. [2025-11-13 08:34:43,332][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:34:43,333][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:34:49,238][__main__][INFO] - Number of regex retries in iteration 89: 0 [2025-11-13 08:34:49,238][__main__][INFO] - agents played in iteration 89 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:34:49,678][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:49,720][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:49,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:49,802][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:49,803][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:34:49,803][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:34:50,534][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:34:50,835][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:34:51,163][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:34:51,494][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:34:51,822][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:34:52,152][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:34:52,482][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:34:52,812][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:34:53,143][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:34:53,473][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:34:53,800][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:34:54,129][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:34:54,456][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:34:54,784][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:34:55,112][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:34:55,441][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:34:55,773][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:34:56,111][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:34:56,446][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:34:56,777][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:34:57,107][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:34:57,438][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:34:57,765][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:34:58,096][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:34:58,423][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:34:58,751][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:34:59,081][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:34:59,410][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:34:59,740][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:35:00,068][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:35:00,397][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:35:00,724][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:35:01,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:35:01,768][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:35:02,522][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:35:02,523][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:35:02,525][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:35:03,681][__main__][INFO] - Iteration 90 took 20s (29.02% Gen, 65.29% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 26m 14s. Estimated total time: 16h 57m 29s. Time estimates for 10 more iterations: 3m 23s, 100 more iterations: 33m 54s, 500 more iterations: 2h 49m 34s. [2025-11-13 08:35:03,687][__main__][INFO] - Starting iteration 90. [2025-11-13 08:35:03,690][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:35:03,691][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:35:09,563][__main__][INFO] - Number of regex retries in iteration 90: 0 [2025-11-13 08:35:09,564][__main__][INFO] - agents played in iteration 90 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:35:09,999][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:10,041][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:10,082][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:10,123][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:10,123][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:35:10,124][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:35:10,855][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:35:11,157][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:35:11,486][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:35:11,817][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:35:12,149][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:35:12,481][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:35:12,809][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:35:13,137][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:35:13,467][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:35:13,794][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:35:14,123][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:35:14,451][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:35:14,780][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:35:15,111][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:35:15,440][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:35:15,768][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:35:16,097][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:35:16,428][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:35:16,757][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:35:17,085][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:35:17,414][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:35:17,742][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:35:18,070][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:35:18,398][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:35:18,727][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:35:19,056][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:35:19,387][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:35:19,715][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:35:20,044][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:35:20,373][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:35:20,708][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:35:21,044][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:35:21,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:35:22,087][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:35:22,835][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:35:22,837][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:35:22,838][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:35:24,924][__main__][INFO] - Iteration 91 took 21s (27.66% Gen, 62.51% Train). Generation: 5s, Training: 13s. Estimated remaining time: 17h 10m 10s. Estimated total time: 17h 41m 46s. Time estimates for 10 more iterations: 3m 32s, 100 more iterations: 35m 23s, 500 more iterations: 2h 56m 57s. [2025-11-13 08:35:24,927][__main__][INFO] - Starting iteration 91. [2025-11-13 08:35:24,929][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:35:24,930][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:35:31,469][__main__][INFO] - Number of regex retries in iteration 91: 0 [2025-11-13 08:35:31,470][__main__][INFO] - agents played in iteration 91 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:35:31,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:31,954][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:31,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:32,033][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:32,034][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:35:32,034][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:35:32,780][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:35:33,080][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:35:33,409][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:35:33,738][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:35:34,066][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:35:34,395][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:35:34,725][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:35:35,055][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:35:35,383][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:35:35,712][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:35:36,047][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:35:36,384][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:35:36,711][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:35:37,040][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:35:37,367][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:35:37,695][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:35:38,023][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:35:38,351][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:35:38,679][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:35:39,007][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:35:39,334][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:35:39,666][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:35:39,996][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:35:40,322][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:35:40,650][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:35:40,984][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:35:41,316][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:35:41,647][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:35:41,975][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:35:42,304][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:35:42,632][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:35:42,960][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:35:43,289][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:35:44,000][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:35:44,745][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:35:44,746][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:35:44,748][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:35:45,724][__main__][INFO] - Iteration 92 took 20s (31.45% Gen, 63.85% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 47m 50s. Estimated total time: 17h 19m 47s. Time estimates for 10 more iterations: 3m 27s, 100 more iterations: 34m 39s, 500 more iterations: 2h 53m 17s. [2025-11-13 08:35:45,726][__main__][INFO] - Starting iteration 92. [2025-11-13 08:35:45,729][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:35:45,729][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:35:51,899][__main__][INFO] - Number of regex retries in iteration 92: 0 [2025-11-13 08:35:51,900][__main__][INFO] - agents played in iteration 92 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:35:52,336][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:52,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:52,417][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:52,458][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:52,458][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:35:52,459][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:35:53,198][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:35:53,497][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:35:53,826][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:35:54,154][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:35:54,482][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:35:54,809][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:35:55,137][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:35:55,465][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:35:55,793][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:35:56,122][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:35:56,453][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:35:56,781][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:35:57,108][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:35:57,439][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:35:57,768][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:35:58,103][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:35:58,431][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:35:58,759][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:35:59,092][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:35:59,422][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:35:59,752][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:36:00,082][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:36:00,410][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:36:00,737][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:36:01,066][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:36:01,397][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:36:01,725][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:36:02,057][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:36:02,387][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:36:02,718][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:36:03,052][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:36:03,379][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:36:03,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:36:04,422][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:36:05,179][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:36:05,180][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:36:05,182][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:36:06,243][__main__][INFO] - Iteration 93 took 20s (30.08% Gen, 64.75% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 33m 29s. Estimated total time: 17h 5m 46s. Time estimates for 10 more iterations: 3m 25s, 100 more iterations: 34m 11s, 500 more iterations: 2h 50m 57s. [2025-11-13 08:36:06,245][__main__][INFO] - Starting iteration 93. [2025-11-13 08:36:06,247][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:36:06,248][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:36:12,378][__main__][INFO] - Number of regex retries in iteration 93: 0 [2025-11-13 08:36:12,379][__main__][INFO] - agents played in iteration 93 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:36:12,814][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:12,856][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:12,896][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:12,937][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:12,938][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:36:12,938][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:36:13,668][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:36:13,967][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:36:14,297][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:36:14,624][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:36:14,952][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:36:15,278][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:36:15,608][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:36:15,935][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:36:16,264][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:36:16,592][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:36:16,923][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:36:17,253][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:36:17,582][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:36:17,910][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:36:18,237][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:36:18,565][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:36:18,893][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:36:19,223][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:36:19,551][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:36:19,880][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:36:20,207][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:36:20,535][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:36:20,863][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:36:21,191][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:36:21,519][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:36:21,846][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:36:22,173][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:36:22,501][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:36:22,829][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:36:23,156][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:36:23,484][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:36:23,811][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:36:24,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:36:24,855][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:36:25,596][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:36:25,597][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:36:25,599][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:36:26,593][__main__][INFO] - Iteration 94 took 20s (30.13% Gen, 64.97% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 24m 42s. Estimated total time: 16h 57m 19s. Time estimates for 10 more iterations: 3m 23s, 100 more iterations: 33m 54s, 500 more iterations: 2h 49m 33s. [2025-11-13 08:36:26,595][__main__][INFO] - Starting iteration 94. [2025-11-13 08:36:26,599][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:36:26,599][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:36:32,815][__main__][INFO] - Number of regex retries in iteration 94: 0 [2025-11-13 08:36:32,816][__main__][INFO] - agents played in iteration 94 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:36:33,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:33,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:33,338][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:33,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:33,379][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:36:33,380][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:36:34,102][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:36:34,400][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:36:34,732][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:36:35,060][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:36:35,388][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:36:35,716][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:36:36,044][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:36:36,376][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:36:36,707][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:36:37,034][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:36:37,362][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:36:37,693][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:36:38,021][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:36:38,348][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:36:38,677][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:36:39,005][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:36:39,333][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:36:39,661][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:36:39,988][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:36:40,316][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:36:40,645][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:36:40,974][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:36:41,302][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:36:41,630][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:36:41,958][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:36:42,286][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:36:42,616][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:36:42,945][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:36:43,274][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:36:43,603][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:36:43,931][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:36:44,262][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:36:44,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:36:45,300][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:36:46,042][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:36:46,044][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:36:46,046][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:36:47,129][__main__][INFO] - Iteration 95 took 20s (30.28% Gen, 64.44% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 33m 35s. Estimated total time: 17h 6m 33s. Time estimates for 10 more iterations: 3m 25s, 100 more iterations: 34m 13s, 500 more iterations: 2h 51m 5s. [2025-11-13 08:36:47,131][__main__][INFO] - Starting iteration 95. [2025-11-13 08:36:47,135][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:36:47,135][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:36:53,268][__main__][INFO] - Number of regex retries in iteration 95: 0 [2025-11-13 08:36:53,269][__main__][INFO] - agents played in iteration 95 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:36:53,707][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:53,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:53,792][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:53,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:53,834][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:36:53,834][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:36:54,560][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:36:54,858][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:36:55,186][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:36:55,515][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:36:55,843][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:36:56,178][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:36:56,509][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:36:56,840][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:36:57,169][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:36:57,499][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:36:57,829][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:36:58,161][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:36:58,497][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:36:58,824][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:36:59,155][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:36:59,490][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:36:59,818][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:37:00,148][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:37:00,476][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:37:00,811][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:37:01,146][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:37:01,478][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:37:01,813][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:37:02,145][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:37:02,479][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:37:02,810][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:37:03,139][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:37:03,468][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:37:03,795][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:37:04,130][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:37:04,461][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:37:04,794][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:37:05,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:37:05,855][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:37:06,605][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:37:06,606][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:37:06,608][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:37:07,637][__main__][INFO] - Iteration 96 took 20s (29.91% Gen, 65.05% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 31m 53s. Estimated total time: 17h 5m 11s. Time estimates for 10 more iterations: 3m 25s, 100 more iterations: 34m 10s, 500 more iterations: 2h 50m 51s. [2025-11-13 08:37:07,640][__main__][INFO] - Starting iteration 96. [2025-11-13 08:37:07,643][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:37:07,644][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:37:13,855][__main__][INFO] - Number of regex retries in iteration 96: 0 [2025-11-13 08:37:13,856][__main__][INFO] - agents played in iteration 96 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:37:14,291][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:14,332][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:14,373][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:14,414][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:14,415][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:37:14,415][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:37:15,148][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:37:15,448][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:37:15,777][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:37:16,105][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:37:16,433][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:37:16,762][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:37:17,089][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:37:17,422][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:37:17,757][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:37:18,085][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:37:18,413][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:37:18,742][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:37:19,070][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:37:19,398][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:37:19,727][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:37:20,056][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:37:20,384][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:37:20,713][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:37:21,041][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:37:21,370][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:37:21,705][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:37:22,036][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:37:22,365][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:37:22,693][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:37:23,022][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:37:23,350][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:37:23,678][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:37:24,006][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:37:24,336][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:37:24,662][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:37:24,991][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:37:25,320][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:37:25,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:37:26,362][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:37:27,103][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:37:27,105][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:37:27,107][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:37:28,066][__main__][INFO] - Iteration 97 took 20s (30.41% Gen, 64.88% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 27m 33s. Estimated total time: 17h 1m 12s. Time estimates for 10 more iterations: 3m 24s, 100 more iterations: 34m 2s, 500 more iterations: 2h 50m 12s. [2025-11-13 08:37:28,069][__main__][INFO] - Starting iteration 97. [2025-11-13 08:37:28,073][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:37:28,073][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:37:34,333][__main__][INFO] - Number of regex retries in iteration 97: 0 [2025-11-13 08:37:34,333][__main__][INFO] - agents played in iteration 97 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:37:34,771][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:34,812][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:34,853][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:34,894][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:34,895][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:37:34,895][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:37:35,623][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:37:35,922][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:37:36,250][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:37:36,578][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:37:36,906][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:37:37,234][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:37:37,566][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:37:37,900][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:37:38,227][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:37:38,558][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:37:38,886][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:37:39,216][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:37:39,551][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:37:39,880][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:37:40,207][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:37:40,536][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:37:40,864][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:37:41,192][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:37:41,522][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:37:41,850][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:37:42,179][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:37:42,507][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:37:42,836][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:37:43,165][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:37:43,492][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:37:43,820][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:37:44,148][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:37:44,479][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:37:44,808][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:37:45,136][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:37:45,467][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:37:45,795][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:37:46,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:37:46,836][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:37:47,580][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:37:47,582][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:37:47,583][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:37:48,658][__main__][INFO] - Iteration 98 took 20s (30.41% Gen, 64.36% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 35m 19s. Estimated total time: 17h 9m 19s. Time estimates for 10 more iterations: 3m 25s, 100 more iterations: 34m 18s, 500 more iterations: 2h 51m 33s. [2025-11-13 08:37:48,660][__main__][INFO] - Starting iteration 98. [2025-11-13 08:37:48,663][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:37:48,664][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:37:55,003][__main__][INFO] - Number of regex retries in iteration 98: 0 [2025-11-13 08:37:55,003][__main__][INFO] - agents played in iteration 98 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:37:55,439][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:55,480][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:55,520][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:55,562][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:55,563][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:37:55,563][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:37:56,304][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:37:56,603][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:37:56,937][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:37:57,267][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:37:57,595][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:37:57,923][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:37:58,251][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:37:58,579][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:37:58,907][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:37:59,235][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:37:59,564][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:37:59,892][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:38:00,219][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:38:00,547][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:38:00,874][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:38:01,203][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:38:01,532][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:38:01,860][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:38:02,188][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:38:02,519][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:38:02,847][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:38:03,174][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:38:03,504][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:38:03,832][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:38:04,160][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:38:04,490][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:38:04,819][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:38:05,148][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:38:05,476][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:38:05,804][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:38:06,132][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:38:06,463][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:38:06,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:38:07,503][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:38:08,243][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:38:08,245][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:38:08,247][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:38:09,500][__main__][INFO] - Iteration 99 took 20s (30.42% Gen, 63.55% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 47m 33s. Estimated total time: 17h 21m 53s. Time estimates for 10 more iterations: 3m 28s, 100 more iterations: 34m 43s, 500 more iterations: 2h 53m 38s. [2025-11-13 08:38:09,502][__main__][INFO] - Starting iteration 99. [2025-11-13 08:38:09,505][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:38:09,505][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:38:15,637][__main__][INFO] - Number of regex retries in iteration 99: 0 [2025-11-13 08:38:15,638][__main__][INFO] - agents played in iteration 99 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:38:16,071][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:16,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:16,152][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:16,193][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:16,193][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:38:16,193][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:38:16,924][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:38:17,222][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:38:17,551][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:38:17,879][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:38:18,207][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:38:18,535][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:38:18,863][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:38:19,196][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:38:19,527][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:38:19,858][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:38:20,186][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:38:20,516][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:38:20,847][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:38:21,178][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:38:21,507][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:38:21,841][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:38:22,172][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:38:22,507][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:38:22,839][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:38:23,173][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:38:23,504][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:38:23,835][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:38:24,169][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:38:24,502][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:38:24,838][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:38:25,169][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:38:25,500][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:38:25,829][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:38:26,159][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:38:26,487][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:38:26,816][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:38:27,145][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:38:27,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:38:28,173][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:38:28,913][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:38:28,914][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:38:28,916][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:38:29,929][__main__][INFO] - Iteration 100 took 20s (30.02% Gen, 65.01% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 26m 34s. Estimated total time: 17h 1m 15s. Time estimates for 10 more iterations: 3m 24s, 100 more iterations: 34m 2s, 500 more iterations: 2h 50m 12s. [2025-11-13 08:38:29,931][__main__][INFO] - Starting iteration 100. [2025-11-13 08:38:29,934][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:38:29,934][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:38:36,066][__main__][INFO] - Number of regex retries in iteration 100: 0 [2025-11-13 08:38:36,067][__main__][INFO] - agents played in iteration 100 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:38:36,511][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:36,552][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:36,592][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:36,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:36,633][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:38:36,634][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:38:37,363][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:38:37,660][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:38:37,989][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:38:38,319][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:38:38,646][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:38:38,977][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:38:39,304][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:38:39,633][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:38:39,967][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:38:40,295][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:38:40,623][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:38:40,949][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:38:41,278][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:38:41,605][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:38:41,933][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:38:42,261][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:38:42,589][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:38:42,917][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:38:43,245][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:38:43,575][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:38:43,905][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:38:44,235][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:38:44,565][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:38:44,894][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:38:45,222][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:38:45,549][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:38:45,877][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:38:46,205][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:38:46,534][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:38:46,861][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:38:47,190][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:38:47,518][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:38:47,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:38:48,559][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:38:49,292][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:38:49,294][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:38:49,296][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:38:51,317][__main__][INFO] - Iteration 101 took 21s (28.68% Gen, 61.86% Train). Generation: 6s, Training: 13s. Estimated remaining time: 17h 14m 10s. Estimated total time: 17h 49m 12s. Time estimates for 10 more iterations: 3m 33s, 100 more iterations: 35m 38s, 500 more iterations: 2h 58m 12s. [2025-11-13 08:38:51,319][__main__][INFO] - Starting iteration 101. [2025-11-13 08:38:51,321][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:38:51,322][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:38:57,865][__main__][INFO] - Number of regex retries in iteration 101: 0 [2025-11-13 08:38:57,866][__main__][INFO] - agents played in iteration 101 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:38:58,307][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:58,348][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:58,389][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:58,430][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:58,430][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:38:58,431][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:38:59,165][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:38:59,462][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:38:59,790][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:39:00,121][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:39:00,449][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:39:00,777][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:39:01,105][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:39:01,432][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:39:01,761][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:39:02,089][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:39:02,418][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:39:02,746][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:39:03,074][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:39:03,402][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:39:03,731][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:39:04,060][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:39:04,389][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:39:04,717][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:39:05,045][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:39:05,374][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:39:05,704][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:39:06,035][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:39:06,367][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:39:06,698][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:39:07,029][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:39:07,357][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:39:07,688][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:39:08,016][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:39:08,344][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:39:08,675][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:39:09,006][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:39:09,335][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:39:09,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:39:10,390][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:39:11,147][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:39:11,149][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:39:11,150][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:39:12,155][__main__][INFO] - Iteration 102 took 20s (31.41% Gen, 63.76% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 46m 20s. Estimated total time: 17h 21m 43s. Time estimates for 10 more iterations: 3m 28s, 100 more iterations: 34m 43s, 500 more iterations: 2h 53m 37s. [2025-11-13 08:39:12,158][__main__][INFO] - Starting iteration 102. [2025-11-13 08:39:12,161][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:39:12,162][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:39:18,700][__main__][INFO] - Number of regex retries in iteration 102: 0 [2025-11-13 08:39:18,700][__main__][INFO] - agents played in iteration 102 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:39:19,142][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:19,184][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:19,224][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:19,265][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:19,266][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:39:19,266][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:39:19,996][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:39:20,293][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:39:20,621][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:39:20,948][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:39:21,276][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:39:21,605][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:39:21,935][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:39:22,264][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:39:22,592][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:39:22,920][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:39:23,248][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:39:23,576][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:39:23,905][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:39:24,231][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:39:24,562][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:39:24,890][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:39:25,218][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:39:25,553][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:39:25,887][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:39:26,217][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:39:26,547][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:39:26,877][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:39:27,208][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:39:27,536][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:39:27,863][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:39:28,192][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:39:28,520][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:39:28,848][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:39:29,176][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:39:29,505][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:39:29,835][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:39:30,164][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:39:30,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:39:31,210][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:39:31,970][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:39:31,971][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:39:31,973][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:39:33,023][__main__][INFO] - Iteration 103 took 20s (31.34% Gen, 63.62% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 47m 24s. Estimated total time: 17h 23m 7s. Time estimates for 10 more iterations: 3m 28s, 100 more iterations: 34m 46s, 500 more iterations: 2h 53m 51s. [2025-11-13 08:39:33,025][__main__][INFO] - Starting iteration 103. [2025-11-13 08:39:33,028][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:39:33,028][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:39:39,453][__main__][INFO] - Number of regex retries in iteration 103: 0 [2025-11-13 08:39:39,453][__main__][INFO] - agents played in iteration 103 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:39:39,887][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:39,928][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:39,969][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:40,010][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:40,010][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:39:40,011][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:39:40,735][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:39:41,037][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:39:41,366][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:39:41,697][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:39:42,024][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:39:42,352][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:39:42,683][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:39:43,011][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:39:43,342][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:39:43,671][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:39:44,002][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:39:44,329][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:39:44,657][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:39:44,985][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:39:45,313][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:39:45,643][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:39:45,977][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:39:46,306][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:39:46,633][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:39:46,961][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:39:47,288][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:39:47,616][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:39:47,945][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:39:48,272][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:39:48,600][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:39:48,928][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:39:49,256][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:39:49,586][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:39:49,913][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:39:50,241][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:39:50,577][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:39:50,906][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:39:51,235][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:39:51,945][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:39:52,688][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:39:52,689][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:39:52,691][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:39:53,744][__main__][INFO] - Iteration 104 took 20s (31.01% Gen, 63.90% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 39m 48s. Estimated total time: 17h 15m 53s. Time estimates for 10 more iterations: 3m 27s, 100 more iterations: 34m 31s, 500 more iterations: 2h 52m 38s. [2025-11-13 08:39:53,747][__main__][INFO] - Starting iteration 104. [2025-11-13 08:39:53,751][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:39:53,752][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:39:58,060][mllm.models.large_language_model_local][WARNING] - Response %A> did not match regex: (|), retry 1/1 [2025-11-13 08:40:00,778][__main__][INFO] - Number of regex retries in iteration 104: 1 [2025-11-13 08:40:00,778][__main__][INFO] - agents played in iteration 104 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:40:01,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:01,270][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:01,304][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:01,337][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:01,338][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:40:01,338][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:40:02,038][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:40:02,333][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:40:02,659][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:40:02,985][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:40:03,311][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:40:03,637][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:40:03,968][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:40:04,294][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:40:04,623][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:40:04,948][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:40:05,276][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:40:05,605][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:40:05,930][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:40:06,258][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:40:06,586][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:40:06,917][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:40:07,244][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:40:07,570][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:40:07,896][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:40:08,223][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:40:08,548][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:40:08,873][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:40:09,199][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:40:09,526][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:40:09,852][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:40:10,178][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:40:10,508][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:40:10,834][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:40:11,160][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:40:11,487][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:40:11,813][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:40:12,138][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:40:12,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:40:13,187][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:40:13,922][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:40:13,924][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:40:13,926][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:40:14,885][__main__][INFO] - Iteration 105 took 21s (33.25% Gen, 62.21% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 0m 20s. Estimated total time: 17h 36m 45s. Time estimates for 10 more iterations: 3m 31s, 100 more iterations: 35m 13s, 500 more iterations: 2h 56m 7s. [2025-11-13 08:40:14,887][__main__][INFO] - Starting iteration 105. [2025-11-13 08:40:14,890][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:40:14,890][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:40:21,421][__main__][INFO] - Number of regex retries in iteration 105: 0 [2025-11-13 08:40:21,422][__main__][INFO] - agents played in iteration 105 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:40:21,887][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:21,922][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:21,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:21,990][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:21,990][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:40:21,991][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:40:22,693][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:40:22,988][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:40:23,313][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:40:23,639][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:40:23,965][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:40:24,291][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:40:24,622][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:40:24,953][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:40:25,281][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:40:25,607][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:40:25,937][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:40:26,265][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:40:26,593][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:40:26,923][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:40:27,248][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:40:27,574][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:40:27,903][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:40:28,230][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:40:28,556][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:40:28,886][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:40:29,214][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:40:29,541][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:40:29,865][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:40:30,191][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:40:30,520][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:40:30,848][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:40:31,173][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:40:31,498][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:40:31,824][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:40:32,149][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:40:32,475][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:40:32,806][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:40:33,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:40:33,846][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:40:34,583][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:40:34,585][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:40:34,586][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:40:35,565][__main__][INFO] - Iteration 106 took 20s (31.59% Gen, 63.67% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 37m 2s. Estimated total time: 17h 13m 48s. Time estimates for 10 more iterations: 3m 26s, 100 more iterations: 34m 27s, 500 more iterations: 2h 52m 18s. [2025-11-13 08:40:35,567][__main__][INFO] - Starting iteration 106. [2025-11-13 08:40:35,570][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:40:35,570][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:40:42,213][__main__][INFO] - Number of regex retries in iteration 106: 0 [2025-11-13 08:40:42,214][__main__][INFO] - agents played in iteration 106 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:40:42,652][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:42,688][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:42,723][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:42,757][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:42,757][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:40:42,758][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:40:43,460][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:40:43,756][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:40:44,082][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:40:44,410][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:40:44,739][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:40:45,065][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:40:45,391][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:40:45,717][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:40:46,043][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:40:46,367][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:40:46,692][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:40:47,018][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:40:47,350][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:40:47,675][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:40:48,001][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:40:48,327][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:40:48,652][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:40:48,978][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:40:49,304][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:40:49,629][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:40:49,955][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:40:50,283][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:40:50,610][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:40:50,935][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:40:51,263][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:40:51,589][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:40:51,914][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:40:52,240][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:40:52,568][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:40:52,895][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:40:53,226][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:40:53,557][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:40:53,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:40:54,600][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:40:55,334][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:40:55,335][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:40:55,337][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:40:56,273][__main__][INFO] - Iteration 107 took 20s (32.09% Gen, 63.38% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 38m 4s. Estimated total time: 17h 15m 11s. Time estimates for 10 more iterations: 3m 27s, 100 more iterations: 34m 30s, 500 more iterations: 2h 52m 31s. [2025-11-13 08:40:56,275][__main__][INFO] - Starting iteration 107. [2025-11-13 08:40:56,277][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:40:56,278][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:41:02,837][__main__][INFO] - Number of regex retries in iteration 107: 0 [2025-11-13 08:41:02,838][__main__][INFO] - agents played in iteration 107 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:41:03,282][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:03,315][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:03,347][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:03,381][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:03,382][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:41:03,382][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:41:04,097][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:41:04,393][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:41:04,719][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:41:05,050][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:41:05,377][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:41:05,703][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:41:06,029][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:41:06,355][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:41:06,681][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:41:07,005][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:41:07,330][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:41:07,655][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:41:07,981][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:41:08,306][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:41:08,633][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:41:08,959][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:41:09,284][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:41:09,614][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:41:09,940][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:41:10,268][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:41:10,594][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:41:10,925][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:41:11,254][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:41:11,580][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:41:11,905][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:41:12,231][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:41:12,556][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:41:12,885][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:41:13,209][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:41:13,534][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:41:13,861][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:41:14,186][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:41:14,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:41:15,219][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:41:15,962][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:41:15,963][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:41:15,965][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:41:16,982][__main__][INFO] - Iteration 108 took 20s (31.68% Gen, 63.40% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 37m 50s. Estimated total time: 17h 15m 18s. Time estimates for 10 more iterations: 3m 27s, 100 more iterations: 34m 30s, 500 more iterations: 2h 52m 33s. [2025-11-13 08:41:16,984][__main__][INFO] - Starting iteration 108. [2025-11-13 08:41:16,987][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:41:16,987][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:41:23,601][__main__][INFO] - Number of regex retries in iteration 108: 0 [2025-11-13 08:41:23,601][__main__][INFO] - agents played in iteration 108 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:41:24,040][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:24,074][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:24,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:24,141][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:24,142][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:41:24,142][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:41:24,871][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:41:25,170][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:41:25,499][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:41:25,827][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:41:26,153][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:41:26,478][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:41:26,804][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:41:27,129][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:41:27,455][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:41:27,782][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:41:28,109][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:41:28,435][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:41:28,762][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:41:29,088][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:41:29,413][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:41:29,738][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:41:30,063][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:41:30,393][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:41:30,722][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:41:31,046][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:41:31,371][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:41:31,697][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:41:32,022][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:41:32,348][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:41:32,673][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:41:33,000][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:41:33,325][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:41:33,651][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:41:33,976][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:41:34,302][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:41:34,628][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:41:34,954][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:41:35,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:41:36,000][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:41:36,755][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:41:36,757][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:41:36,758][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:41:37,747][__main__][INFO] - Iteration 109 took 20s (31.86% Gen, 63.38% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 40m 13s. Estimated total time: 17h 18m 1s. Time estimates for 10 more iterations: 3m 27s, 100 more iterations: 34m 36s, 500 more iterations: 2h 53m 0s. [2025-11-13 08:41:37,749][__main__][INFO] - Starting iteration 109. [2025-11-13 08:41:37,752][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:41:37,752][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:41:44,453][__main__][INFO] - Number of regex retries in iteration 109: 0 [2025-11-13 08:41:44,453][__main__][INFO] - agents played in iteration 109 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:41:44,905][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:44,949][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:44,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:45,017][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:45,018][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:41:45,018][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:41:45,758][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:41:46,055][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:41:46,383][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:41:46,708][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:41:47,038][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:41:47,366][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:41:47,693][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:41:48,021][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:41:48,349][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:41:48,679][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:41:49,008][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:41:49,336][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:41:49,664][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:41:49,989][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:41:50,315][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:41:50,640][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:41:50,966][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:41:51,291][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:41:51,618][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:41:51,944][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:41:52,269][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:41:52,597][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:41:52,922][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:41:53,248][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:41:53,573][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:41:53,898][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:41:54,223][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:41:54,550][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:41:54,875][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:41:55,201][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:41:55,528][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:41:55,855][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:41:56,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:41:56,895][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:41:57,640][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:41:57,642][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:41:57,643][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:41:58,615][__main__][INFO] - Iteration 110 took 20s (32.12% Gen, 63.22% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 45m 2s. Estimated total time: 17h 23m 11s. Time estimates for 10 more iterations: 3m 28s, 100 more iterations: 34m 46s, 500 more iterations: 2h 53m 51s. [2025-11-13 08:41:58,617][__main__][INFO] - Starting iteration 110. [2025-11-13 08:41:58,620][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:41:58,620][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:42:05,328][__main__][INFO] - Number of regex retries in iteration 110: 0 [2025-11-13 08:42:05,328][__main__][INFO] - agents played in iteration 110 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:42:05,779][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:05,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:05,846][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:05,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:05,880][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:42:05,880][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:42:06,603][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:42:06,899][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:42:07,225][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:42:07,552][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:42:07,878][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:42:08,204][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:42:08,530][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:42:08,856][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:42:09,183][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:42:09,508][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:42:09,834][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:42:10,165][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:42:10,490][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:42:10,816][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:42:11,142][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:42:11,468][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:42:11,794][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:42:12,120][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:42:12,447][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:42:12,771][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:42:13,098][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:42:13,427][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:42:13,756][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:42:14,081][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:42:14,409][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:42:14,734][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:42:15,060][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:42:15,386][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:42:15,713][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:42:16,038][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:42:16,363][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:42:16,689][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:42:17,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:42:17,727][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:42:18,470][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:42:18,472][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:42:18,473][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:42:20,390][__main__][INFO] - Iteration 111 took 21s (30.81% Gen, 60.38% Train). Generation: 6s, Training: 13s. Estimated remaining time: 17h 30m 1s. Estimated total time: 18h 8m 32s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 17s, 500 more iterations: 3h 1m 25s. [2025-11-13 08:42:20,392][__main__][INFO] - Starting iteration 111. [2025-11-13 08:42:20,396][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:42:20,396][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:42:27,559][__main__][INFO] - Number of regex retries in iteration 111: 0 [2025-11-13 08:42:27,560][__main__][INFO] - agents played in iteration 111 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:42:28,006][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:28,044][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:28,077][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:28,111][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:28,111][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:42:28,112][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:42:28,835][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:42:29,146][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:42:29,472][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:42:29,799][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:42:30,124][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:42:30,452][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:42:30,777][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:42:31,106][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:42:31,432][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:42:31,757][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:42:32,083][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:42:32,409][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:42:32,734][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:42:33,060][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:42:33,384][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:42:33,710][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:42:34,035][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:42:34,361][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:42:34,686][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:42:35,017][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:42:35,345][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:42:35,671][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:42:35,998][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:42:36,324][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:42:36,650][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:42:36,975][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:42:37,301][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:42:37,627][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:42:37,955][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:42:38,280][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:42:38,609][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:42:38,935][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:42:39,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:42:39,984][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:42:40,748][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:42:40,750][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:42:40,751][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:42:41,732][__main__][INFO] - Iteration 112 took 21s (33.57% Gen, 61.82% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 7m 59s. Estimated total time: 17h 46m 52s. Time estimates for 10 more iterations: 3m 33s, 100 more iterations: 35m 33s, 500 more iterations: 2h 57m 48s. [2025-11-13 08:42:41,734][__main__][INFO] - Starting iteration 112. [2025-11-13 08:42:41,737][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:42:41,738][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:42:48,710][__main__][INFO] - Number of regex retries in iteration 112: 0 [2025-11-13 08:42:48,711][__main__][INFO] - agents played in iteration 112 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:42:49,173][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:49,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:49,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:49,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:49,278][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:42:49,278][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:42:49,997][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:42:50,292][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:42:50,618][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:42:50,944][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:42:51,269][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:42:51,596][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:42:51,922][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:42:52,247][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:42:52,573][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:42:52,898][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:42:53,223][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:42:53,549][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:42:53,878][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:42:54,205][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:42:54,532][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:42:54,858][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:42:55,185][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:42:55,510][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:42:55,837][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:42:56,163][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:42:56,490][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:42:56,821][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:42:57,149][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:42:57,477][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:42:57,807][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:42:58,135][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:42:58,461][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:42:58,788][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:42:59,116][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:42:59,444][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:42:59,769][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:43:00,096][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:43:00,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:43:01,139][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:43:01,879][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:43:01,881][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:43:01,882][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:43:02,853][__main__][INFO] - Iteration 113 took 21s (33.02% Gen, 62.38% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 56m 35s. Estimated total time: 17h 35m 48s. Time estimates for 10 more iterations: 3m 31s, 100 more iterations: 35m 11s, 500 more iterations: 2h 55m 58s. [2025-11-13 08:43:02,855][__main__][INFO] - Starting iteration 113. [2025-11-13 08:43:02,858][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:43:02,859][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:43:09,773][__main__][INFO] - Number of regex retries in iteration 113: 0 [2025-11-13 08:43:09,774][__main__][INFO] - agents played in iteration 113 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:43:10,214][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:10,248][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:10,280][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:10,314][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:10,314][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:43:10,314][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:43:11,039][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:43:11,335][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:43:11,663][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:43:11,992][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:43:12,318][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:43:12,645][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:43:12,975][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:43:13,300][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:43:13,628][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:43:13,954][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:43:14,285][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:43:14,615][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:43:14,944][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:43:15,272][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:43:15,600][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:43:15,930][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:43:16,258][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:43:16,585][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:43:16,912][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:43:17,240][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:43:17,567][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:43:17,894][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:43:18,224][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:43:18,549][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:43:18,877][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:43:19,202][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:43:19,528][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:43:19,856][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:43:20,181][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:43:20,507][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:43:20,832][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:43:21,157][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:43:21,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:43:22,206][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:43:22,971][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:43:22,973][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:43:22,975][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:43:23,996][__main__][INFO] - Iteration 114 took 21s (32.71% Gen, 62.45% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 57m 21s. Estimated total time: 17h 36m 55s. Time estimates for 10 more iterations: 3m 31s, 100 more iterations: 35m 13s, 500 more iterations: 2h 56m 9s. [2025-11-13 08:43:23,998][__main__][INFO] - Starting iteration 114. [2025-11-13 08:43:24,000][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:43:24,001][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:43:30,879][__main__][INFO] - Number of regex retries in iteration 114: 0 [2025-11-13 08:43:30,880][__main__][INFO] - agents played in iteration 114 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:43:31,329][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:31,363][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:31,397][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:31,431][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:31,431][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:43:31,431][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:43:32,153][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:43:32,447][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:43:32,774][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:43:33,100][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:43:33,425][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:43:33,755][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:43:34,083][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:43:34,409][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:43:34,734][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:43:35,061][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:43:35,386][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:43:35,713][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:43:36,040][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:43:36,369][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:43:36,694][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:43:37,022][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:43:37,347][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:43:37,674][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:43:38,000][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:43:38,325][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:43:38,653][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:43:38,979][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:43:39,310][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:43:39,638][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:43:39,964][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:43:40,293][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:43:40,619][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:43:40,946][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:43:41,276][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:43:41,607][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:43:41,932][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:43:42,258][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:43:42,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:43:43,316][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:43:44,057][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:43:44,058][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:43:44,060][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:43:45,064][__main__][INFO] - Iteration 115 took 21s (32.66% Gen, 62.57% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 53m 18s. Estimated total time: 17h 33m 14s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 6s, 500 more iterations: 2h 55m 32s. [2025-11-13 08:43:45,066][__main__][INFO] - Starting iteration 115. [2025-11-13 08:43:45,069][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:43:45,070][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:43:51,954][__main__][INFO] - Number of regex retries in iteration 115: 0 [2025-11-13 08:43:51,955][__main__][INFO] - agents played in iteration 115 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:43:52,390][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:52,424][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:52,459][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:52,493][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:52,493][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:43:52,494][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:43:53,230][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:43:53,527][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:43:53,854][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:43:54,182][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:43:54,513][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:43:54,838][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:43:55,163][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:43:55,488][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:43:55,814][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:43:56,141][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:43:56,466][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:43:56,792][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:43:57,117][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:43:57,443][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:43:57,771][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:43:58,099][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:43:58,425][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:43:58,751][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:43:59,076][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:43:59,404][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:43:59,732][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:44:00,062][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:44:00,394][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:44:00,721][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:44:01,052][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:44:01,379][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:44:01,706][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:44:02,032][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:44:02,357][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:44:02,683][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:44:03,009][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:44:03,336][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:44:03,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:44:04,377][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:44:05,124][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:44:05,126][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:44:05,127][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:44:06,089][__main__][INFO] - Iteration 116 took 21s (32.75% Gen, 62.67% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 50m 44s. Estimated total time: 17h 31m 1s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 2s, 500 more iterations: 2h 55m 10s. [2025-11-13 08:44:06,091][__main__][INFO] - Starting iteration 116. [2025-11-13 08:44:06,094][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:44:06,094][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:44:12,922][__main__][INFO] - Number of regex retries in iteration 116: 0 [2025-11-13 08:44:12,923][__main__][INFO] - agents played in iteration 116 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:44:13,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:13,398][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:13,432][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:13,466][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:13,467][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:44:13,467][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:44:14,186][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:44:14,482][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:44:14,810][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:44:15,135][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:44:15,460][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:44:15,787][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:44:16,116][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:44:16,444][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:44:16,775][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:44:17,103][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:44:17,431][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:44:17,758][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:44:18,084][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:44:18,412][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:44:18,737][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:44:19,063][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:44:19,388][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:44:19,715][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:44:20,041][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:44:20,367][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:44:20,697][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:44:21,026][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:44:21,354][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:44:21,681][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:44:22,007][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:44:22,333][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:44:22,659][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:44:22,985][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:44:23,314][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:44:23,640][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:44:23,971][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:44:24,300][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:44:24,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:44:25,362][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:44:26,108][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:44:26,110][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:44:26,112][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:44:27,098][__main__][INFO] - Iteration 117 took 21s (32.51% Gen, 62.79% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 49m 38s. Estimated total time: 17h 30m 15s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 0s, 500 more iterations: 2h 55m 2s. [2025-11-13 08:44:27,100][__main__][INFO] - Starting iteration 117. [2025-11-13 08:44:27,103][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:44:27,104][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:44:33,949][__main__][INFO] - Number of regex retries in iteration 117: 0 [2025-11-13 08:44:33,950][__main__][INFO] - agents played in iteration 117 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:44:34,391][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:34,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:34,462][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:34,497][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:34,498][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:44:34,498][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:44:35,218][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:44:35,515][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:44:35,843][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:44:36,168][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:44:36,495][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:44:36,823][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:44:37,149][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:44:37,474][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:44:37,800][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:44:38,128][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:44:38,453][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:44:38,779][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:44:39,109][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:44:39,437][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:44:39,765][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:44:40,094][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:44:40,419][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:44:40,747][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:44:41,075][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:44:41,406][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:44:41,733][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:44:42,064][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:44:42,393][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:44:42,721][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:44:43,053][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:44:43,379][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:44:43,706][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:44:44,032][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:44:44,358][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:44:44,683][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:44:45,009][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:44:45,335][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:44:45,661][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:44:46,378][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:44:47,111][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:44:47,113][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:44:47,115][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:44:48,063][__main__][INFO] - Iteration 118 took 20s (32.66% Gen, 62.81% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 47m 4s. Estimated total time: 17h 28m 3s. Time estimates for 10 more iterations: 3m 29s, 100 more iterations: 34m 56s, 500 more iterations: 2h 54m 40s. [2025-11-13 08:44:48,065][__main__][INFO] - Starting iteration 118. [2025-11-13 08:44:48,068][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:44:48,068][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:44:54,983][__main__][INFO] - Number of regex retries in iteration 118: 0 [2025-11-13 08:44:54,983][__main__][INFO] - agents played in iteration 118 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:44:55,442][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:55,475][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:55,509][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:55,544][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:55,544][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:44:55,545][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:44:56,257][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:44:56,553][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:44:56,880][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:44:57,205][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:44:57,532][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:44:57,862][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:44:58,189][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:44:58,519][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:44:58,847][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:44:59,173][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:44:59,501][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:44:59,831][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:45:00,163][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:45:00,488][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:45:00,813][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:45:01,140][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:45:01,471][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:45:01,802][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:45:02,133][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:45:02,460][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:45:02,787][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:45:03,112][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:45:03,439][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:45:03,764][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:45:04,089][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:45:04,415][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:45:04,742][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:45:05,068][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:45:05,392][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:45:05,720][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:45:06,045][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:45:06,373][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:45:06,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:45:07,420][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:45:08,168][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:45:08,169][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:45:08,171][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:45:09,118][__main__][INFO] - Iteration 119 took 21s (32.85% Gen, 62.65% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 51m 13s. Estimated total time: 17h 32m 32s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 5s, 500 more iterations: 2h 55m 25s. [2025-11-13 08:45:09,120][__main__][INFO] - Starting iteration 119. [2025-11-13 08:45:09,122][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:45:09,123][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:45:16,018][__main__][INFO] - Number of regex retries in iteration 119: 0 [2025-11-13 08:45:16,018][__main__][INFO] - agents played in iteration 119 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:45:16,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:16,522][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:16,556][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:16,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:16,591][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:45:16,591][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:45:17,291][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:45:17,587][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:45:17,915][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:45:18,241][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:45:18,574][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:45:18,904][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:45:19,230][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:45:19,554][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:45:19,881][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:45:20,206][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:45:20,532][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:45:20,858][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:45:21,184][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:45:21,510][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:45:21,837][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:45:22,164][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:45:22,488][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:45:22,815][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:45:23,140][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:45:23,468][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:45:23,793][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:45:24,118][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:45:24,445][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:45:24,775][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:45:25,100][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:45:25,426][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:45:25,752][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:45:26,077][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:45:26,404][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:45:26,730][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:45:27,056][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:45:27,380][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:45:27,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:45:28,413][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:45:29,142][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:45:29,143][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:45:29,145][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:45:30,196][__main__][INFO] - Iteration 120 took 21s (32.72% Gen, 62.28% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 52m 3s. Estimated total time: 17h 33m 44s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 7s, 500 more iterations: 2h 55m 37s. [2025-11-13 08:45:30,198][__main__][INFO] - Starting iteration 120. [2025-11-13 08:45:30,201][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:45:30,201][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:45:36,955][__main__][INFO] - Number of regex retries in iteration 120: 0 [2025-11-13 08:45:36,955][__main__][INFO] - agents played in iteration 120 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:45:37,403][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:37,438][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:37,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:37,504][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:37,505][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:45:37,505][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:45:38,230][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:45:38,527][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:45:38,853][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:45:39,180][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:45:39,504][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:45:39,829][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:45:40,156][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:45:40,482][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:45:40,808][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:45:41,133][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:45:41,460][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:45:41,786][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:45:42,111][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:45:42,436][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:45:42,764][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:45:43,092][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:45:43,417][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:45:43,745][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:45:44,071][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:45:44,400][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:45:44,727][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:45:45,053][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:45:45,378][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:45:45,704][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:45:46,032][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:45:46,358][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:45:46,683][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:45:47,012][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:45:47,337][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:45:47,663][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:45:47,989][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:45:48,316][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:45:48,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:45:49,362][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:45:50,106][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:45:50,108][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:45:50,109][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:45:52,050][__main__][INFO] - Iteration 121 took 21s (30.91% Gen, 60.20% Train). Generation: 6s, Training: 13s. Estimated remaining time: 17h 30m 27s. Estimated total time: 18h 12m 30s. Time estimates for 10 more iterations: 3m 38s, 100 more iterations: 36m 25s, 500 more iterations: 3h 2m 5s. [2025-11-13 08:45:52,052][__main__][INFO] - Starting iteration 121. [2025-11-13 08:45:52,055][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:45:52,056][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:45:59,483][__main__][INFO] - Number of regex retries in iteration 121: 0 [2025-11-13 08:45:59,483][__main__][INFO] - agents played in iteration 121 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:45:59,930][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:59,966][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:00,000][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:00,033][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:00,034][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:46:00,034][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:46:00,759][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:46:01,055][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:46:01,382][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:46:01,707][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:46:02,033][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:46:02,358][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:46:02,683][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:46:03,009][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:46:03,334][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:46:03,659][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:46:03,986][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:46:04,311][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:46:04,638][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:46:04,966][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:46:05,291][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:46:05,617][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:46:05,943][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:46:06,269][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:46:06,601][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:46:06,926][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:46:07,252][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:46:07,578][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:46:07,904][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:46:08,230][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:46:08,555][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:46:08,881][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:46:09,207][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:46:09,532][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:46:09,858][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:46:10,184][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:46:10,510][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:46:10,834][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:46:11,159][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:46:11,874][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:46:12,625][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:46:12,626][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:46:12,628][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:46:13,618][__main__][INFO] - Iteration 122 took 21s (34.45% Gen, 60.96% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 15m 45s. Estimated total time: 17h 58m 10s. Time estimates for 10 more iterations: 3m 35s, 100 more iterations: 35m 56s, 500 more iterations: 2h 59m 41s. [2025-11-13 08:46:13,620][__main__][INFO] - Starting iteration 122. [2025-11-13 08:46:13,623][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:46:13,624][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:46:21,031][__main__][INFO] - Number of regex retries in iteration 122: 0 [2025-11-13 08:46:21,032][__main__][INFO] - agents played in iteration 122 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:46:21,474][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:21,507][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:21,541][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:21,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:21,575][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:46:21,576][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:46:22,307][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:46:22,603][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:46:22,933][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:46:23,261][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:46:23,592][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:46:23,919][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:46:24,248][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:46:24,574][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:46:24,905][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:46:25,238][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:46:25,567][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:46:25,894][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:46:26,220][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:46:26,551][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:46:26,877][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:46:27,202][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:46:27,529][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:46:27,855][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:46:28,181][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:46:28,507][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:46:28,835][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:46:29,167][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:46:29,493][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:46:29,819][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:46:30,147][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:46:30,473][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:46:30,799][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:46:31,125][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:46:31,452][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:46:31,777][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:46:32,103][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:46:32,429][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:46:32,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:46:33,485][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:46:34,236][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:46:34,238][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:46:34,240][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:46:35,202][__main__][INFO] - Iteration 123 took 21s (34.33% Gen, 61.21% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 16m 11s. Estimated total time: 17h 58m 57s. Time estimates for 10 more iterations: 3m 35s, 100 more iterations: 35m 57s, 500 more iterations: 2h 59m 49s. [2025-11-13 08:46:35,204][__main__][INFO] - Starting iteration 123. [2025-11-13 08:46:35,207][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:46:35,207][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:46:42,540][__main__][INFO] - Number of regex retries in iteration 123: 0 [2025-11-13 08:46:42,541][__main__][INFO] - agents played in iteration 123 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:46:42,978][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:43,013][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:43,047][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:43,081][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:43,082][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:46:43,082][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:46:43,791][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:46:44,087][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:46:44,418][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:46:44,743][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:46:45,072][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:46:45,400][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:46:45,726][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:46:46,054][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:46:46,380][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:46:46,707][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:46:47,032][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:46:47,359][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:46:47,684][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:46:48,012][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:46:48,338][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:46:48,665][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:46:48,991][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:46:49,316][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:46:49,642][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:46:49,968][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:46:50,295][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:46:50,620][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:46:50,946][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:46:51,275][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:46:51,601][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:46:51,927][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:46:52,255][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:46:52,581][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:46:52,908][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:46:53,234][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:46:53,560][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:46:53,887][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:46:54,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:46:54,927][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:46:55,666][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:46:55,667][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:46:55,669][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:46:56,624][__main__][INFO] - Iteration 124 took 21s (34.24% Gen, 61.29% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 7m 47s. Estimated total time: 17h 50m 54s. Time estimates for 10 more iterations: 3m 34s, 100 more iterations: 35m 41s, 500 more iterations: 2h 58m 29s. [2025-11-13 08:46:56,626][__main__][INFO] - Starting iteration 124. [2025-11-13 08:46:56,629][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:46:56,630][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:47:03,769][__main__][INFO] - Number of regex retries in iteration 124: 0 [2025-11-13 08:47:03,770][__main__][INFO] - agents played in iteration 124 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:47:04,215][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:04,248][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:04,281][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:04,315][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:04,315][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:47:04,316][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:47:05,052][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:47:05,348][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:47:05,675][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:47:06,005][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:47:06,330][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:47:06,656][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:47:06,982][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:47:07,307][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:47:07,632][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:47:07,957][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:47:08,283][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:47:08,608][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:47:08,937][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:47:09,262][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:47:09,587][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:47:09,913][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:47:10,239][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:47:10,564][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:47:10,892][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:47:11,217][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:47:11,543][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:47:11,868][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:47:12,193][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:47:12,518][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:47:12,843][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:47:13,169][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:47:13,495][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:47:13,821][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:47:14,146][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:47:14,471][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:47:14,797][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:47:15,122][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:47:15,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:47:16,163][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:47:16,906][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:47:16,908][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:47:16,909][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:47:17,892][__main__][INFO] - Iteration 125 took 21s (33.58% Gen, 61.79% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 59m 44s. Estimated total time: 17h 43m 12s. Time estimates for 10 more iterations: 3m 32s, 100 more iterations: 35m 26s, 500 more iterations: 2h 57m 12s. [2025-11-13 08:47:17,895][__main__][INFO] - Starting iteration 125. [2025-11-13 08:47:17,898][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:47:17,899][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:47:25,354][__main__][INFO] - Number of regex retries in iteration 125: 0 [2025-11-13 08:47:25,354][__main__][INFO] - agents played in iteration 125 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:47:25,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:25,857][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:25,892][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:25,926][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:25,927][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:47:25,927][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:47:26,641][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:47:26,937][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:47:27,263][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:47:27,590][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:47:27,918][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:47:28,244][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:47:28,571][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:47:28,898][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:47:29,225][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:47:29,551][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:47:29,876][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:47:30,202][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:47:30,528][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:47:30,853][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:47:31,179][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:47:31,506][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:47:31,831][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:47:32,157][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:47:32,483][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:47:32,810][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:47:33,136][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:47:33,463][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:47:33,790][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:47:34,116][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:47:34,441][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:47:34,766][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:47:35,092][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:47:35,417][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:47:35,743][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:47:36,068][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:47:36,393][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:47:36,720][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:47:37,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:47:37,791][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:47:38,619][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:47:38,621][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:47:38,623][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:47:39,585][__main__][INFO] - Iteration 126 took 21s (34.38% Gen, 61.18% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 20m 33s. Estimated total time: 18h 4m 23s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 8s, 500 more iterations: 3h 0m 43s. [2025-11-13 08:47:39,587][__main__][INFO] - Starting iteration 126. [2025-11-13 08:47:39,590][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:47:39,590][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:47:46,650][__main__][INFO] - Number of regex retries in iteration 126: 0 [2025-11-13 08:47:46,650][__main__][INFO] - agents played in iteration 126 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:47:47,099][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:47,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:47,170][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:47,204][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:47,205][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:47:47,205][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:47:47,916][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:47:48,213][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:47:48,538][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:47:48,866][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:47:49,194][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:47:49,520][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:47:49,852][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:47:50,180][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:47:50,507][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:47:50,834][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:47:51,160][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:47:51,487][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:47:51,812][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:47:52,137][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:47:52,463][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:47:52,789][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:47:53,113][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:47:53,439][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:47:53,765][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:47:54,090][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:47:54,416][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:47:54,747][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:47:55,073][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:47:55,405][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:47:55,732][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:47:56,058][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:47:56,384][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:47:56,710][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:47:57,035][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:47:57,361][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:47:57,687][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:47:58,017][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:47:58,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:47:59,060][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:47:59,799][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:47:59,800][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:47:59,802][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:48:00,737][__main__][INFO] - Iteration 127 took 21s (33.38% Gen, 62.19% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 53m 13s. Estimated total time: 17h 37m 24s. Time estimates for 10 more iterations: 3m 31s, 100 more iterations: 35m 14s, 500 more iterations: 2h 56m 14s. [2025-11-13 08:48:00,740][__main__][INFO] - Starting iteration 127. [2025-11-13 08:48:00,742][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:48:00,743][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:48:07,814][__main__][INFO] - Number of regex retries in iteration 127: 0 [2025-11-13 08:48:07,815][__main__][INFO] - agents played in iteration 127 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:48:08,258][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:08,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:08,331][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:08,365][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:08,366][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:48:08,366][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:48:09,081][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:48:09,376][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:48:09,704][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:48:10,030][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:48:10,359][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:48:10,685][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:48:11,011][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:48:11,339][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:48:11,665][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:48:11,991][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:48:12,318][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:48:12,644][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:48:12,971][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:48:13,298][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:48:13,626][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:48:13,952][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:48:14,280][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:48:14,606][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:48:14,931][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:48:15,259][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:48:15,585][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:48:15,911][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:48:16,237][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:48:16,567][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:48:16,894][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:48:17,223][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:48:17,550][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:48:17,875][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:48:18,201][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:48:18,527][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:48:18,852][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:48:19,178][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:48:19,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:48:20,213][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:48:20,940][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:48:20,942][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:48:20,943][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:48:21,948][__main__][INFO] - Iteration 128 took 21s (33.35% Gen, 61.91% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 55m 47s. Estimated total time: 17h 40m 20s. Time estimates for 10 more iterations: 3m 32s, 100 more iterations: 35m 20s, 500 more iterations: 2h 56m 43s. [2025-11-13 08:48:21,950][__main__][INFO] - Starting iteration 128. [2025-11-13 08:48:21,953][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:48:21,954][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:48:29,312][__main__][INFO] - Number of regex retries in iteration 128: 0 [2025-11-13 08:48:29,312][__main__][INFO] - agents played in iteration 128 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:48:29,751][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:29,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:29,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:29,854][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:29,855][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:48:29,855][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:48:30,564][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:48:30,860][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:48:31,185][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:48:31,512][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:48:31,838][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:48:32,163][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:48:32,489][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:48:32,815][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:48:33,139][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:48:33,464][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:48:33,790][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:48:34,115][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:48:34,440][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:48:34,766][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:48:35,091][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:48:35,418][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:48:35,743][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:48:36,070][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:48:36,397][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:48:36,724][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:48:37,054][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:48:37,385][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:48:37,712][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:48:38,039][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:48:38,365][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:48:38,694][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:48:39,019][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:48:39,348][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:48:39,675][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:48:40,007][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:48:40,332][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:48:40,664][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:48:40,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:48:41,707][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:48:42,442][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:48:42,443][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:48:42,445][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:48:43,394][__main__][INFO] - Iteration 129 took 21s (34.32% Gen, 61.25% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 7m 9s. Estimated total time: 17h 52m 4s. Time estimates for 10 more iterations: 3m 34s, 100 more iterations: 35m 44s, 500 more iterations: 2h 58m 40s. [2025-11-13 08:48:43,395][__main__][INFO] - Starting iteration 129. [2025-11-13 08:48:43,398][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:48:43,399][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:48:50,698][__main__][INFO] - Number of regex retries in iteration 129: 0 [2025-11-13 08:48:50,699][__main__][INFO] - agents played in iteration 129 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:48:51,143][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:51,180][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:51,214][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:51,248][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:51,249][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:48:51,249][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:48:51,952][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:48:52,248][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:48:52,574][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:48:52,900][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:48:53,225][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:48:53,550][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:48:53,876][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:48:54,201][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:48:54,527][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:48:54,853][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:48:55,179][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:48:55,504][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:48:55,829][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:48:56,155][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:48:56,483][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:48:56,809][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:48:57,139][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:48:57,465][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:48:57,791][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:48:58,117][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:48:58,443][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:48:58,768][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:48:59,094][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:48:59,422][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:48:59,749][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:49:00,074][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:49:00,399][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:49:00,725][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:49:01,051][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:49:01,377][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:49:01,705][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:49:02,031][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:49:02,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:49:03,076][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:49:03,825][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:49:03,826][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:49:03,828][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:49:04,875][__main__][INFO] - Iteration 130 took 21s (33.99% Gen, 61.13% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 8m 39s. Estimated total time: 17h 53m 55s. Time estimates for 10 more iterations: 3m 34s, 100 more iterations: 35m 47s, 500 more iterations: 2h 58m 59s. [2025-11-13 08:49:04,878][__main__][INFO] - Starting iteration 130. [2025-11-13 08:49:04,880][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:49:04,881][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:49:12,051][__main__][INFO] - Number of regex retries in iteration 130: 0 [2025-11-13 08:49:12,052][__main__][INFO] - agents played in iteration 130 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:49:12,496][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:12,532][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:12,566][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:12,600][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:12,601][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:49:12,601][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:49:13,298][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:49:13,594][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:49:13,920][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:49:14,246][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:49:14,571][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:49:14,899][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:49:15,225][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:49:15,555][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:49:15,884][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:49:16,212][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:49:16,539][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:49:16,866][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:49:17,193][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:49:17,522][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:49:17,848][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:49:18,175][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:49:18,501][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:49:18,829][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:49:19,154][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:49:19,481][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:49:19,806][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:49:20,133][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:49:20,458][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:49:20,784][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:49:21,110][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:49:21,442][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:49:21,774][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:49:22,099][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:49:22,425][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:49:22,755][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:49:23,082][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:49:23,409][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:49:23,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:49:24,447][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:49:25,183][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:49:25,185][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:49:25,186][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:49:27,059][__main__][INFO] - Iteration 131 took 22s (32.33% Gen, 59.22% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 43m 19s. Estimated total time: 18h 28m 57s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 57s, 500 more iterations: 3h 4m 49s. [2025-11-13 08:49:27,060][__main__][INFO] - Starting iteration 131. [2025-11-13 08:49:27,063][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:49:27,064][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:49:34,708][__main__][INFO] - Number of regex retries in iteration 131: 0 [2025-11-13 08:49:34,708][__main__][INFO] - agents played in iteration 131 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:49:35,158][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:35,191][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:35,224][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:35,258][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:35,258][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:49:35,259][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:49:35,977][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:49:36,274][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:49:36,604][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:49:36,931][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:49:37,257][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:49:37,584][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:49:37,909][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:49:38,234][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:49:38,560][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:49:38,891][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:49:39,217][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:49:39,545][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:49:39,875][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:49:40,202][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:49:40,529][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:49:40,855][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:49:41,182][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:49:41,510][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:49:41,836][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:49:42,161][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:49:42,487][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:49:42,814][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:49:43,141][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:49:43,466][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:49:43,792][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:49:44,120][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:49:44,446][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:49:44,770][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:49:45,096][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:49:45,423][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:49:45,749][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:49:46,074][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:49:46,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:49:47,116][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:49:47,861][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:49:47,862][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:49:47,864][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:49:48,839][__main__][INFO] - Iteration 132 took 21s (35.10% Gen, 60.41% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 22m 51s. Estimated total time: 18h 8m 51s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 17s, 500 more iterations: 3h 1m 28s. [2025-11-13 08:49:48,841][__main__][INFO] - Starting iteration 132. [2025-11-13 08:49:48,844][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:49:48,845][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:49:56,470][__main__][INFO] - Number of regex retries in iteration 132: 0 [2025-11-13 08:49:56,471][__main__][INFO] - agents played in iteration 132 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:49:56,913][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:56,947][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:56,981][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:57,015][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:57,015][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:49:57,016][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:49:57,731][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:49:58,026][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:49:58,352][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:49:58,678][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:49:59,003][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:49:59,330][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:49:59,654][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:49:59,980][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:50:00,305][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:50:00,630][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:50:00,955][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:50:01,281][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:50:01,607][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:50:01,931][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:50:02,258][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:50:02,585][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:50:02,913][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:50:03,244][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:50:03,571][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:50:03,899][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:50:04,227][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:50:04,557][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:50:04,882][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:50:05,208][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:50:05,534][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:50:05,864][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:50:06,190][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:50:06,516][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:50:06,841][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:50:07,167][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:50:07,491][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:50:07,818][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:50:08,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:50:08,868][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:50:09,625][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:50:09,627][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:50:09,628][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:50:10,593][__main__][INFO] - Iteration 133 took 21s (35.06% Gen, 60.50% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 21m 6s. Estimated total time: 18h 7m 27s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 14s, 500 more iterations: 3h 1m 14s. [2025-11-13 08:50:10,595][__main__][INFO] - Starting iteration 133. [2025-11-13 08:50:10,597][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:50:10,598][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:50:18,103][__main__][INFO] - Number of regex retries in iteration 133: 0 [2025-11-13 08:50:18,104][__main__][INFO] - agents played in iteration 133 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:50:18,561][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:18,596][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:18,630][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:18,664][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:18,664][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:50:18,665][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:50:19,393][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:50:19,690][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:50:20,017][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:50:20,348][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:50:20,677][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:50:21,005][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:50:21,330][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:50:21,656][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:50:21,985][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:50:22,312][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:50:22,639][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:50:22,965][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:50:23,293][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:50:23,621][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:50:23,948][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:50:24,275][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:50:24,601][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:50:24,928][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:50:25,254][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:50:25,578][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:50:25,904][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:50:26,230][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:50:26,559][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:50:26,885][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:50:27,212][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:50:27,539][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:50:27,865][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:50:28,192][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:50:28,517][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:50:28,844][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:50:29,172][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:50:29,499][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:50:29,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:50:30,550][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:50:31,292][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:50:31,293][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:50:31,295][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:50:32,262][__main__][INFO] - Iteration 134 took 21s (34.64% Gen, 60.88% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 16m 33s. Estimated total time: 18h 3m 16s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 6s, 500 more iterations: 3h 0m 32s. [2025-11-13 08:50:32,264][__main__][INFO] - Starting iteration 134. [2025-11-13 08:50:32,266][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:50:32,267][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:50:39,852][__main__][INFO] - Number of regex retries in iteration 134: 0 [2025-11-13 08:50:39,852][__main__][INFO] - agents played in iteration 134 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:50:40,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:40,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:40,369][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:40,404][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:40,404][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:50:40,405][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:50:41,121][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:50:41,417][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:50:41,745][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:50:42,071][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:50:42,396][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:50:42,721][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:50:43,046][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:50:43,372][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:50:43,698][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:50:44,023][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:50:44,348][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:50:44,673][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:50:44,999][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:50:45,323][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:50:45,650][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:50:45,975][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:50:46,300][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:50:46,626][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:50:46,951][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:50:47,277][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:50:47,603][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:50:47,928][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:50:48,252][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:50:48,581][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:50:48,907][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:50:49,233][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:50:49,558][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:50:49,883][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:50:50,208][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:50:50,533][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:50:50,859][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:50:51,185][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:50:51,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:50:52,231][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:50:52,965][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:50:52,966][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:50:52,968][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:50:53,910][__main__][INFO] - Iteration 135 took 21s (35.04% Gen, 60.60% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 15m 8s. Estimated total time: 18h 2m 13s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 4s, 500 more iterations: 3h 0m 22s. [2025-11-13 08:50:53,912][__main__][INFO] - Starting iteration 135. [2025-11-13 08:50:53,914][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:50:53,915][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:51:01,310][__main__][INFO] - Number of regex retries in iteration 135: 0 [2025-11-13 08:51:01,311][__main__][INFO] - agents played in iteration 135 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:51:01,752][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:01,787][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:01,821][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:01,855][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:01,856][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:51:01,856][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:51:02,576][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:51:02,871][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:51:03,201][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:51:03,528][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:51:03,855][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:51:04,185][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:51:04,511][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:51:04,837][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:51:05,164][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:51:05,492][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:51:05,823][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:51:06,156][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:51:06,488][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:51:06,819][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:51:07,145][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:51:07,471][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:51:07,798][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:51:08,125][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:51:08,449][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:51:08,775][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:51:09,103][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:51:09,428][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:51:09,754][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:51:10,081][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:51:10,406][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:51:10,733][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:51:11,059][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:51:11,384][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:51:11,712][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:51:12,038][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:51:12,369][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:51:12,695][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:51:13,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:51:13,749][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:51:14,509][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:51:14,511][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:51:14,513][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:51:15,458][__main__][INFO] - Iteration 136 took 21s (34.33% Gen, 61.28% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 9m 46s. Estimated total time: 17h 57m 12s. Time estimates for 10 more iterations: 3m 35s, 100 more iterations: 35m 54s, 500 more iterations: 2h 59m 32s. [2025-11-13 08:51:15,460][__main__][INFO] - Starting iteration 136. [2025-11-13 08:51:15,462][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:51:15,463][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:51:22,897][__main__][INFO] - Number of regex retries in iteration 136: 0 [2025-11-13 08:51:22,898][__main__][INFO] - agents played in iteration 136 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:51:23,339][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:23,374][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:23,408][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:23,442][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:23,443][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:51:23,443][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:51:24,158][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:51:24,454][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:51:24,781][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:51:25,108][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:51:25,435][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:51:25,760][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:51:26,087][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:51:26,411][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:51:26,736][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:51:27,064][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:51:27,389][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:51:27,715][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:51:28,042][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:51:28,367][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:51:28,694][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:51:29,022][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:51:29,347][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:51:29,673][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:51:29,998][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:51:30,323][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:51:30,648][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:51:30,974][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:51:31,301][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:51:31,626][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:51:31,951][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:51:32,277][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:51:32,602][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:51:32,928][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:51:33,253][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:51:33,578][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:51:33,904][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:51:34,228][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:51:34,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:51:35,264][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:51:36,002][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:51:36,003][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:51:36,005][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:51:36,956][__main__][INFO] - Iteration 137 took 21s (34.59% Gen, 60.98% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 6m 56s. Estimated total time: 17h 54m 44s. Time estimates for 10 more iterations: 3m 34s, 100 more iterations: 35m 49s, 500 more iterations: 2h 59m 7s. [2025-11-13 08:51:36,958][__main__][INFO] - Starting iteration 137. [2025-11-13 08:51:36,961][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:51:36,962][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:51:44,419][__main__][INFO] - Number of regex retries in iteration 137: 0 [2025-11-13 08:51:44,419][__main__][INFO] - agents played in iteration 137 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:51:44,865][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:44,898][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:44,932][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:44,966][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:44,967][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:51:44,967][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:51:45,684][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:51:45,979][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:51:46,305][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:51:46,633][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:51:46,962][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:51:47,289][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:51:47,616][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:51:47,942][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:51:48,270][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:51:48,597][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:51:48,921][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:51:49,248][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:51:49,575][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:51:49,901][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:51:50,226][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:51:50,551][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:51:50,878][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:51:51,203][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:51:51,529][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:51:51,857][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:51:52,182][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:51:52,509][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:51:52,834][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:51:53,161][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:51:53,488][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:51:53,814][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:51:54,140][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:51:54,466][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:51:54,793][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:51:55,120][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:51:55,446][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:51:55,771][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:51:56,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:51:56,809][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:51:57,548][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:51:57,549][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:51:57,551][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:51:58,518][__main__][INFO] - Iteration 138 took 21s (34.59% Gen, 60.91% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 9m 45s. Estimated total time: 17h 57m 54s. Time estimates for 10 more iterations: 3m 35s, 100 more iterations: 35m 55s, 500 more iterations: 2h 59m 39s. [2025-11-13 08:51:58,521][__main__][INFO] - Starting iteration 138. [2025-11-13 08:51:58,524][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:51:58,524][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:52:06,085][__main__][INFO] - Number of regex retries in iteration 138: 0 [2025-11-13 08:52:06,085][__main__][INFO] - agents played in iteration 138 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:52:06,526][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:06,563][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:06,596][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:06,630][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:06,630][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:52:06,631][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:52:07,334][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:52:07,630][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:52:07,957][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:52:08,282][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:52:08,608][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:52:08,936][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:52:09,264][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:52:09,592][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:52:09,919][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:52:10,246][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:52:10,572][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:52:10,901][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:52:11,231][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:52:11,559][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:52:11,886][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:52:12,213][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:52:12,538][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:52:12,864][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:52:13,189][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:52:13,517][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:52:13,845][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:52:14,170][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:52:14,496][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:52:14,822][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:52:15,152][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:52:15,480][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:52:15,809][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:52:16,136][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:52:16,466][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:52:16,794][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:52:17,120][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:52:17,447][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:52:17,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:52:18,484][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:52:19,211][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:52:19,213][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:52:19,214][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:52:20,158][__main__][INFO] - Iteration 139 took 21s (34.95% Gen, 60.69% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 13m 13s. Estimated total time: 18h 1m 44s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 3s, 500 more iterations: 3h 0m 17s. [2025-11-13 08:52:20,160][__main__][INFO] - Starting iteration 139. [2025-11-13 08:52:20,162][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:52:20,163][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:52:27,687][__main__][INFO] - Number of regex retries in iteration 139: 0 [2025-11-13 08:52:27,688][__main__][INFO] - agents played in iteration 139 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:52:28,127][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:28,163][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:28,197][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:28,231][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:28,232][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:52:28,232][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:52:28,955][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:52:29,251][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:52:29,579][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:52:29,904][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:52:30,234][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:52:30,558][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:52:30,887][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:52:31,215][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:52:31,546][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:52:31,878][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:52:32,207][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:52:32,533][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:52:32,862][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:52:33,193][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:52:33,522][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:52:33,850][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:52:34,175][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:52:34,500][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:52:34,826][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:52:35,152][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:52:35,480][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:52:35,806][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:52:36,134][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:52:36,460][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:52:36,787][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:52:37,112][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:52:37,439][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:52:37,766][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:52:38,092][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:52:38,418][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:52:38,744][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:52:39,069][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:52:39,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:52:40,116][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:52:40,863][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:52:40,865][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:52:40,866][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:52:41,913][__main__][INFO] - Iteration 140 took 21s (34.59% Gen, 60.59% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 18m 41s. Estimated total time: 18h 7m 34s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 15s, 500 more iterations: 3h 1m 15s. [2025-11-13 08:52:41,915][__main__][INFO] - Starting iteration 140. [2025-11-13 08:52:41,917][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:52:41,918][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:52:49,477][__main__][INFO] - Number of regex retries in iteration 140: 0 [2025-11-13 08:52:49,478][__main__][INFO] - agents played in iteration 140 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:52:49,919][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:49,952][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:49,986][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:50,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:50,021][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:52:50,021][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:52:50,722][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:52:51,018][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:52:51,345][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:52:51,672][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:52:51,998][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:52:52,324][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:52:52,649][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:52:52,980][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:52:53,308][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:52:53,639][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:52:53,969][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:52:54,293][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:52:54,619][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:52:54,946][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:52:55,272][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:52:55,604][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:52:55,931][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:52:56,262][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:52:56,587][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:52:56,918][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:52:57,247][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:52:57,573][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:52:57,899][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:52:58,224][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:52:58,555][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:52:58,880][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:52:59,206][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:52:59,531][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:52:59,859][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:53:00,185][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:53:00,510][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:53:00,836][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:53:01,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:53:01,875][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:53:02,608][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:53:02,609][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:53:02,611][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:53:04,901][__main__][INFO] - Iteration 141 took 22s (32.89% Gen, 57.14% Train). Generation: 7s, Training: 13s. Estimated remaining time: 18h 19m 58s. Estimated total time: 19h 9m 14s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 18s, 500 more iterations: 3h 11m 32s. [2025-11-13 08:53:04,903][__main__][INFO] - Starting iteration 141. [2025-11-13 08:53:04,906][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:53:04,906][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:53:12,631][__main__][INFO] - Number of regex retries in iteration 141: 0 [2025-11-13 08:53:12,631][__main__][INFO] - agents played in iteration 141 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:53:13,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:13,116][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:13,150][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:13,184][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:13,185][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:53:13,186][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:53:13,853][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:53:14,150][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:53:14,477][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:53:14,802][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:53:15,134][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:53:15,461][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:53:15,788][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:53:16,114][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:53:16,443][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:53:16,768][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:53:17,096][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:53:17,422][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:53:17,749][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:53:18,076][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:53:18,401][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:53:18,726][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:53:19,054][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:53:19,381][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:53:19,707][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:53:20,033][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:53:20,358][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:53:20,683][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:53:21,010][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:53:21,336][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:53:21,662][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:53:21,987][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:53:22,314][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:53:22,639][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:53:22,965][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:53:23,293][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:53:23,619][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:53:23,946][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:53:24,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:53:24,987][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:53:25,720][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:53:25,721][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:53:25,723][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:53:26,659][__main__][INFO] - Iteration 142 took 21s (35.51% Gen, 60.18% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 18m 5s. Estimated total time: 18h 7m 42s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 15s, 500 more iterations: 3h 1m 17s. [2025-11-13 08:53:26,662][__main__][INFO] - Starting iteration 142. [2025-11-13 08:53:26,664][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:53:26,665][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:53:34,282][__main__][INFO] - Number of regex retries in iteration 142: 0 [2025-11-13 08:53:34,282][__main__][INFO] - agents played in iteration 142 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:53:34,731][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:34,764][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:34,798][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:34,832][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:34,833][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:53:34,833][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:53:35,533][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:53:35,828][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:53:36,155][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:53:36,480][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:53:36,806][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:53:37,132][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:53:37,458][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:53:37,783][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:53:38,110][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:53:38,436][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:53:38,763][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:53:39,091][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:53:39,419][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:53:39,748][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:53:40,074][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:53:40,404][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:53:40,732][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:53:41,057][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:53:41,385][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:53:41,710][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:53:42,036][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:53:42,363][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:53:42,689][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:53:43,016][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:53:43,341][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:53:43,667][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:53:43,994][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:53:44,320][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:53:44,647][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:53:44,974][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:53:45,302][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:53:45,631][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:53:45,959][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:53:46,680][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:53:47,405][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:53:47,406][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:53:47,407][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:53:48,336][__main__][INFO] - Iteration 143 took 21s (35.15% Gen, 60.56% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 13m 37s. Estimated total time: 18h 3m 36s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 7s, 500 more iterations: 3h 0m 36s. [2025-11-13 08:53:48,338][__main__][INFO] - Starting iteration 143. [2025-11-13 08:53:48,340][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:53:48,341][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:53:56,078][__main__][INFO] - Number of regex retries in iteration 143: 0 [2025-11-13 08:53:56,078][__main__][INFO] - agents played in iteration 143 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:53:56,524][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:56,560][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:56,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:56,629][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:56,630][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:53:56,630][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:53:57,334][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:53:57,630][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:53:57,956][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:53:58,282][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:53:58,608][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:53:58,937][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:53:59,262][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:53:59,587][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:53:59,915][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:54:00,241][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:54:00,566][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:54:00,892][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:54:01,217][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:54:01,545][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:54:01,873][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:54:02,205][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:54:02,535][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:54:02,864][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:54:03,195][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:54:03,527][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:54:03,853][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:54:04,180][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:54:04,509][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:54:04,837][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:54:05,163][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:54:05,490][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:54:05,817][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:54:06,146][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:54:06,472][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:54:06,799][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:54:07,126][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:54:07,453][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:54:07,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:54:08,497][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:54:09,226][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:54:09,228][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:54:09,230][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:54:10,205][__main__][INFO] - Iteration 144 took 21s (35.39% Gen, 60.15% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 22m 54s. Estimated total time: 18h 13m 15s. Time estimates for 10 more iterations: 3m 38s, 100 more iterations: 36m 26s, 500 more iterations: 3h 2m 12s. [2025-11-13 08:54:10,207][__main__][INFO] - Starting iteration 144. [2025-11-13 08:54:10,210][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:54:10,210][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:54:17,904][__main__][INFO] - Number of regex retries in iteration 144: 0 [2025-11-13 08:54:17,905][__main__][INFO] - agents played in iteration 144 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:54:18,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:18,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:18,413][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:18,447][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:18,447][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:54:18,448][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:54:19,183][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:54:19,479][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:54:19,806][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:54:20,132][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:54:20,459][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:54:20,785][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:54:21,111][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:54:21,437][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:54:21,763][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:54:22,089][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:54:22,415][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:54:22,743][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:54:23,071][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:54:23,396][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:54:23,721][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:54:24,047][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:54:24,374][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:54:24,700][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:54:25,025][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:54:25,350][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:54:25,676][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:54:26,004][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:54:26,331][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:54:26,656][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:54:26,981][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:54:27,307][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:54:27,635][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:54:27,961][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:54:28,287][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:54:28,612][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:54:28,938][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:54:29,267][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:54:29,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:54:30,329][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:54:31,086][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:54:31,087][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:54:31,089][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:54:32,054][__main__][INFO] - Iteration 145 took 21s (35.22% Gen, 60.35% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 21m 34s. Estimated total time: 18h 12m 17s. Time estimates for 10 more iterations: 3m 38s, 100 more iterations: 36m 24s, 500 more iterations: 3h 2m 2s. [2025-11-13 08:54:32,056][__main__][INFO] - Starting iteration 145. [2025-11-13 08:54:32,059][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:54:32,059][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:54:39,826][__main__][INFO] - Number of regex retries in iteration 145: 0 [2025-11-13 08:54:39,827][__main__][INFO] - agents played in iteration 145 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:54:40,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:40,295][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:40,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:40,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:40,365][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:54:40,365][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:54:41,074][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:54:41,370][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:54:41,696][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:54:42,022][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:54:42,347][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:54:42,672][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:54:42,997][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:54:43,323][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:54:43,649][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:54:43,975][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:54:44,303][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:54:44,628][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:54:44,953][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:54:45,280][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:54:45,605][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:54:45,931][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:54:46,256][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:54:46,583][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:54:46,908][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:54:47,234][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:54:47,561][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:54:47,887][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:54:48,213][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:54:48,538][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:54:48,863][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:54:49,190][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:54:49,514][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:54:49,840][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:54:50,166][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:54:50,493][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:54:50,819][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:54:51,145][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:54:51,477][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:54:52,198][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:54:52,933][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:54:52,935][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:54:52,937][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:54:53,967][__main__][INFO] - Iteration 146 took 21s (35.45% Gen, 59.84% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 24m 23s. Estimated total time: 18h 15m 27s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 30s, 500 more iterations: 3h 2m 34s. [2025-11-13 08:54:53,970][__main__][INFO] - Starting iteration 146. [2025-11-13 08:54:53,972][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:54:53,973][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:54:58,402][mllm.models.large_language_model_local][WARNING] - Response did not match regex: (|), retry 1/1 [2025-11-13 08:55:01,915][__main__][INFO] - Number of regex retries in iteration 146: 1 [2025-11-13 08:55:01,916][__main__][INFO] - agents played in iteration 146 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:55:02,384][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:02,418][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:02,452][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:02,485][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:02,486][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:55:02,486][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:55:03,182][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:55:03,476][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:55:03,803][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:55:04,129][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:55:04,457][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:55:04,781][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:55:05,108][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:55:05,434][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:55:05,760][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:55:06,085][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:55:06,410][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:55:06,738][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:55:07,063][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:55:07,390][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:55:07,717][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:55:08,042][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:55:08,368][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:55:08,694][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:55:09,020][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:55:09,347][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:55:09,674][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:55:10,000][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:55:10,325][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:55:10,653][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:55:10,979][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:55:11,309][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:55:11,635][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:55:11,960][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:55:12,286][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:55:12,615][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:55:12,942][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:55:13,269][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:55:13,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:55:14,315][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:55:15,042][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:55:15,044][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:55:15,046][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:55:15,971][__main__][INFO] - Iteration 147 took 21s (36.10% Gen, 59.68% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 28m 33s. Estimated total time: 18h 19m 59s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 39s, 500 more iterations: 3h 3m 19s. [2025-11-13 08:55:15,973][__main__][INFO] - Starting iteration 147. [2025-11-13 08:55:15,976][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:55:15,976][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:55:23,693][__main__][INFO] - Number of regex retries in iteration 147: 0 [2025-11-13 08:55:23,694][__main__][INFO] - agents played in iteration 147 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:55:24,152][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:24,187][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:24,221][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:24,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:24,256][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:55:24,256][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:55:24,961][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:55:25,257][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:55:25,583][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:55:25,909][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:55:26,233][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:55:26,560][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:55:26,888][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:55:27,214][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:55:27,539][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:55:27,865][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:55:28,190][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:55:28,515][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:55:28,841][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:55:29,166][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:55:29,494][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:55:29,819][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:55:30,146][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:55:30,471][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:55:30,796][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:55:31,124][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:55:31,450][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:55:31,775][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:55:32,100][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:55:32,427][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:55:32,752][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:55:33,080][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:55:33,405][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:55:33,730][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:55:34,056][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:55:34,383][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:55:34,708][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:55:35,034][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:55:35,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:55:36,157][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:55:36,885][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:55:36,886][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:55:36,888][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:55:37,825][__main__][INFO] - Iteration 148 took 21s (35.32% Gen, 60.39% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 20m 40s. Estimated total time: 18h 12m 29s. Time estimates for 10 more iterations: 3m 38s, 100 more iterations: 36m 24s, 500 more iterations: 3h 2m 4s. [2025-11-13 08:55:37,827][__main__][INFO] - Starting iteration 148. [2025-11-13 08:55:37,830][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:55:37,830][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:55:45,466][__main__][INFO] - Number of regex retries in iteration 148: 0 [2025-11-13 08:55:45,467][__main__][INFO] - agents played in iteration 148 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:55:45,916][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:45,950][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:45,984][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:46,018][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:46,019][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:55:46,020][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:55:46,740][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:55:47,038][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:55:47,369][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:55:47,696][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:55:48,020][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:55:48,347][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:55:48,673][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:55:48,999][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:55:49,324][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:55:49,650][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:55:49,977][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:55:50,301][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:55:50,627][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:55:50,952][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:55:51,277][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:55:51,603][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:55:51,928][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:55:52,255][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:55:52,581][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:55:52,906][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:55:53,233][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:55:53,559][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:55:53,884][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:55:54,211][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:55:54,536][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:55:54,862][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:55:55,188][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:55:55,513][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:55:55,842][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:55:56,168][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:55:56,493][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:55:56,819][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:55:57,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:55:57,856][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:55:58,589][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:55:58,590][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:55:58,593][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:55:59,535][__main__][INFO] - Iteration 149 took 21s (35.18% Gen, 60.47% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 13m 6s. Estimated total time: 18h 5m 17s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 10s, 500 more iterations: 3h 0m 52s. [2025-11-13 08:55:59,537][__main__][INFO] - Starting iteration 149. [2025-11-13 08:55:59,540][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:55:59,540][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:56:07,413][__main__][INFO] - Number of regex retries in iteration 149: 0 [2025-11-13 08:56:07,414][__main__][INFO] - agents played in iteration 149 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:56:07,856][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:07,889][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:07,923][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:07,957][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:07,958][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:56:07,958][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:56:08,672][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:56:08,968][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:56:09,294][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:56:09,619][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:56:09,944][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:56:10,269][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:56:10,595][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:56:10,920][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:56:11,245][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:56:11,571][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:56:11,896][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:56:12,221][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:56:12,547][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:56:12,874][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:56:13,200][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:56:13,525][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:56:13,849][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:56:14,174][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:56:14,500][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:56:14,825][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:56:15,153][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:56:15,479][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:56:15,804][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:56:16,132][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:56:16,457][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:56:16,783][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:56:17,108][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:56:17,435][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:56:17,764][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:56:18,088][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:56:18,414][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:56:18,740][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:56:19,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:56:19,789][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:56:20,531][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:56:20,532][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:56:20,534][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:56:21,487][__main__][INFO] - Iteration 150 took 21s (35.87% Gen, 59.78% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 24m 52s. Estimated total time: 18h 17m 24s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 34s, 500 more iterations: 3h 2m 54s. [2025-11-13 08:56:21,489][__main__][INFO] - Starting iteration 150. [2025-11-13 08:56:21,492][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:56:21,492][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:56:29,202][__main__][INFO] - Number of regex retries in iteration 150: 0 [2025-11-13 08:56:29,203][__main__][INFO] - agents played in iteration 150 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:56:29,660][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:29,693][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:29,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:29,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:29,760][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:56:29,761][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:56:30,478][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:56:30,773][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:56:31,100][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:56:31,426][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:56:31,752][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:56:32,079][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:56:32,403][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:56:32,729][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:56:33,054][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:56:33,380][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:56:33,706][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:56:34,032][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:56:34,357][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:56:34,683][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:56:35,008][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:56:35,333][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:56:35,659][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:56:35,984][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:56:36,312][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:56:36,638][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:56:36,968][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:56:37,296][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:56:37,625][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:56:37,953][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:56:38,278][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:56:38,605][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:56:38,930][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:56:39,255][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:56:39,581][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:56:39,909][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:56:40,236][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:56:40,564][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:56:40,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:56:41,620][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:56:42,364][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:56:42,365][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:56:42,367][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:56:44,271][__main__][INFO] - Iteration 151 took 22s (33.85% Gen, 57.79% Train). Generation: 7s, Training: 13s. Estimated remaining time: 18h 6m 6s. Estimated total time: 18h 59m 1s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 58s, 500 more iterations: 3h 9m 50s. [2025-11-13 08:56:44,273][__main__][INFO] - Starting iteration 151. [2025-11-13 08:56:44,276][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:56:44,277][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:56:52,453][__main__][INFO] - Number of regex retries in iteration 151: 0 [2025-11-13 08:56:52,453][__main__][INFO] - agents played in iteration 151 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:56:52,897][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:52,933][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:52,967][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:53,001][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:53,002][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:56:53,002][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:56:53,724][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:56:54,022][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:56:54,348][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:56:54,675][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:56:55,001][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:56:55,329][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:56:55,654][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:56:55,985][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:56:56,311][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:56:56,636][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:56:56,961][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:56:57,285][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:56:57,610][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:56:57,935][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:56:58,260][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:56:58,587][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:56:58,913][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:56:59,237][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:56:59,563][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:56:59,888][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:57:00,214][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:57:00,541][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:57:00,867][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:57:01,193][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:57:01,518][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:57:01,845][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:57:02,169][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:57:02,495][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:57:02,821][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:57:03,147][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:57:03,472][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:57:03,799][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:57:04,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:57:04,837][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:57:05,575][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:57:05,577][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:57:05,579][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:57:06,524][__main__][INFO] - Iteration 152 took 22s (36.75% Gen, 58.99% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 39m 9s. Estimated total time: 18h 32m 27s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 4s, 500 more iterations: 3h 5m 24s. [2025-11-13 08:57:06,527][__main__][INFO] - Starting iteration 152. [2025-11-13 08:57:06,529][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:57:06,529][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:57:14,501][__main__][INFO] - Number of regex retries in iteration 152: 0 [2025-11-13 08:57:14,502][__main__][INFO] - agents played in iteration 152 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:57:14,943][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:14,979][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:15,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:15,046][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:15,047][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:57:15,047][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:57:15,766][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:57:16,062][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:57:16,389][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:57:16,716][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:57:17,042][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:57:17,370][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:57:17,695][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:57:18,019][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:57:18,346][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:57:18,671][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:57:18,996][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:57:19,322][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:57:19,648][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:57:19,974][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:57:20,302][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:57:20,627][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:57:20,953][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:57:21,280][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:57:21,607][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:57:21,937][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:57:22,262][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:57:22,588][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:57:22,914][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:57:23,239][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:57:23,564][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:57:23,892][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:57:24,216][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:57:24,541][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:57:24,867][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:57:25,192][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:57:25,519][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:57:25,845][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:57:26,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:57:26,889][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:57:27,640][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:57:27,641][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:57:27,643][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:57:28,845][__main__][INFO] - Iteration 153 took 22s (35.72% Gen, 58.88% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 42m 10s. Estimated total time: 18h 35m 50s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 11s, 500 more iterations: 3h 5m 58s. [2025-11-13 08:57:28,847][__main__][INFO] - Starting iteration 153. [2025-11-13 08:57:28,849][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:57:28,850][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:57:36,862][__main__][INFO] - Number of regex retries in iteration 153: 0 [2025-11-13 08:57:36,863][__main__][INFO] - agents played in iteration 153 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:57:37,306][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:37,342][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:37,376][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:37,410][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:37,410][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:57:37,411][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:57:38,138][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:57:38,434][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:57:38,761][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:57:39,086][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:57:39,417][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:57:39,742][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:57:40,069][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:57:40,396][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:57:40,723][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:57:41,050][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:57:41,379][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:57:41,705][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:57:42,035][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:57:42,367][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:57:42,697][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:57:43,023][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:57:43,355][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:57:43,682][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:57:44,013][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:57:44,342][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:57:44,674][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:57:45,002][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:57:45,330][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:57:45,657][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:57:45,985][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:57:46,312][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:57:46,640][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:57:46,970][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:57:47,299][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:57:47,631][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:57:47,960][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:57:48,289][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:57:48,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:57:49,342][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:57:50,042][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:57:50,043][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:57:50,044][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:57:50,945][__main__][INFO] - Iteration 154 took 22s (36.26% Gen, 59.66% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 30m 46s. Estimated total time: 18h 24m 48s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 49s, 500 more iterations: 3h 4m 8s. [2025-11-13 08:57:50,947][__main__][INFO] - Starting iteration 154. [2025-11-13 08:57:50,949][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:57:50,950][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:57:58,829][__main__][INFO] - Number of regex retries in iteration 154: 0 [2025-11-13 08:57:58,829][__main__][INFO] - agents played in iteration 154 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:57:59,273][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:59,309][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:59,343][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:59,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:59,378][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:57:59,378][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:58:00,093][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:58:00,388][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:58:00,716][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:58:01,047][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:58:01,374][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:58:01,702][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:58:02,028][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:58:02,361][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:58:02,688][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:58:03,016][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:58:03,344][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:58:03,671][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:58:03,997][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:58:04,328][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:58:04,653][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:58:04,979][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:58:05,309][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:58:05,637][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:58:05,962][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:58:06,294][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:58:06,620][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:58:06,946][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:58:07,273][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:58:07,599][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:58:07,927][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:58:08,255][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:58:08,585][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:58:08,911][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:58:09,237][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:58:09,564][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:58:09,892][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:58:10,220][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:58:10,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:58:11,278][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:58:11,971][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:58:11,973][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:58:11,974][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:58:12,868][__main__][INFO] - Iteration 155 took 21s (35.95% Gen, 59.97% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 21m 35s. Estimated total time: 18h 15m 58s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 31s, 500 more iterations: 3h 2m 39s. [2025-11-13 08:58:12,870][__main__][INFO] - Starting iteration 155. [2025-11-13 08:58:12,873][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:58:12,873][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:58:20,732][__main__][INFO] - Number of regex retries in iteration 155: 0 [2025-11-13 08:58:20,733][__main__][INFO] - agents played in iteration 155 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:58:21,176][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:21,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:21,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:21,276][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:21,277][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:58:21,278][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:58:22,013][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:58:22,309][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:58:22,636][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:58:22,967][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:58:23,294][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:58:23,622][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:58:23,947][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:58:24,272][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:58:24,598][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:58:24,923][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:58:25,248][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:58:25,574][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:58:25,899][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:58:26,224][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:58:26,549][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:58:26,874][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:58:27,200][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:58:27,525][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:58:27,852][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:58:28,178][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:58:28,509][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:58:28,835][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:58:29,161][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:58:29,486][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:58:29,811][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:58:30,137][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:58:30,465][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:58:30,793][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:58:31,119][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:58:31,445][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:58:31,771][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:58:32,097][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:58:32,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:58:33,149][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:58:33,854][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:58:33,856][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:58:33,857][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:58:34,804][__main__][INFO] - Iteration 156 took 21s (35.83% Gen, 59.84% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 21m 50s. Estimated total time: 18h 16m 35s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 33s, 500 more iterations: 3h 2m 45s. [2025-11-13 08:58:34,806][__main__][INFO] - Starting iteration 156. [2025-11-13 08:58:34,809][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:58:34,809][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:58:42,616][__main__][INFO] - Number of regex retries in iteration 156: 0 [2025-11-13 08:58:42,617][__main__][INFO] - agents played in iteration 156 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:58:43,067][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:43,103][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:43,137][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:43,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:43,171][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:58:43,171][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:58:43,886][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:58:44,181][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:58:44,509][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:58:44,835][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:58:45,162][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:58:45,488][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:58:45,815][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:58:46,142][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:58:46,469][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:58:46,794][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:58:47,119][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:58:47,446][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:58:47,772][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:58:48,098][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:58:48,424][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:58:48,750][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:58:49,076][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:58:49,402][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:58:49,727][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:58:50,052][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:58:50,378][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:58:50,704][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:58:51,030][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:58:51,355][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:58:51,683][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:58:52,012][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:58:52,340][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:58:52,672][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:58:53,000][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:58:53,331][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:58:53,660][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:58:53,989][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:58:54,318][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:58:55,038][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:58:55,733][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:58:55,734][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:58:55,736][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:58:56,641][__main__][INFO] - Iteration 157 took 21s (35.76% Gen, 60.09% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 16m 31s. Estimated total time: 18h 11m 38s. Time estimates for 10 more iterations: 3m 38s, 100 more iterations: 36m 23s, 500 more iterations: 3h 1m 56s. [2025-11-13 08:58:56,643][__main__][INFO] - Starting iteration 157. [2025-11-13 08:58:56,646][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:58:56,647][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:59:04,198][__main__][INFO] - Number of regex retries in iteration 157: 0 [2025-11-13 08:59:04,199][__main__][INFO] - agents played in iteration 157 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:59:04,641][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:04,676][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:04,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:04,744][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:04,744][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:59:04,745][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:59:05,453][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:59:05,749][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:59:06,078][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:59:06,406][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:59:06,731][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:59:07,056][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:59:07,381][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:59:07,707][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:59:08,039][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:59:08,366][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:59:08,691][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:59:09,016][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:59:09,342][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:59:09,666][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:59:09,993][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:59:10,318][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:59:10,647][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:59:10,973][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:59:11,298][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:59:11,623][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:59:11,951][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:59:12,277][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:59:12,605][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:59:12,934][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:59:13,259][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:59:13,586][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:59:13,916][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:59:14,248][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:59:14,576][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:59:14,905][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:59:15,233][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:59:15,560][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:59:15,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:59:16,616][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:59:17,310][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:59:17,312][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:59:17,313][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:59:18,288][__main__][INFO] - Iteration 158 took 21s (34.89% Gen, 60.60% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 6m 38s. Estimated total time: 18h 2m 7s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 4s, 500 more iterations: 3h 0m 21s. [2025-11-13 08:59:18,290][__main__][INFO] - Starting iteration 158. [2025-11-13 08:59:18,293][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:59:18,294][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:59:26,013][__main__][INFO] - Number of regex retries in iteration 158: 0 [2025-11-13 08:59:26,013][__main__][INFO] - agents played in iteration 158 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:59:26,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:26,490][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:26,524][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:26,557][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:26,558][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:59:26,558][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:59:27,283][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:59:27,579][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:59:27,905][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:59:28,229][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:59:28,557][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:59:28,882][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:59:29,209][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:59:29,536][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:59:29,863][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:59:30,189][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:59:30,513][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:59:30,840][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:59:31,166][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:59:31,492][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:59:31,817][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:59:32,143][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:59:32,468][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:59:32,793][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:59:33,120][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:59:33,447][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:59:33,773][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:59:34,102][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:59:34,433][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:59:34,761][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:59:35,089][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:59:35,421][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:59:35,753][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:59:36,083][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:59:36,411][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:59:36,737][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:59:37,063][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:59:37,390][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:59:37,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:59:38,434][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:59:39,225][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:59:39,227][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:59:39,229][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:59:40,269][__main__][INFO] - Iteration 159 took 21s (35.13% Gen, 60.13% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 22m 59s. Estimated total time: 18h 18m 50s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 37s, 500 more iterations: 3h 3m 8s. [2025-11-13 08:59:40,271][__main__][INFO] - Starting iteration 159. [2025-11-13 08:59:40,274][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:59:40,274][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:59:48,029][__main__][INFO] - Number of regex retries in iteration 159: 0 [2025-11-13 08:59:48,030][__main__][INFO] - agents played in iteration 159 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 08:59:48,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:48,507][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:48,541][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:48,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:48,576][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:59:48,576][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:59:49,273][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:59:49,568][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:59:49,894][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:59:50,227][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:59:50,556][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:59:50,882][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:59:51,211][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:59:51,542][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:59:51,872][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:59:52,200][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:59:52,532][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:59:52,863][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:59:53,194][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:59:53,522][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:59:53,848][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:59:54,177][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:59:54,505][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:59:54,835][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:59:55,164][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:59:55,491][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:59:55,820][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:59:56,152][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:59:56,484][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:59:56,811][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:59:57,137][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:59:57,463][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:59:57,788][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:59:58,115][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:59:58,441][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:59:58,767][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:59:59,095][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:59:59,421][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:59:59,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:00:00,468][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:00:01,200][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:00:01,202][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:00:01,204][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:00:02,179][__main__][INFO] - Iteration 160 took 21s (35.40% Gen, 60.14% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 19m 3s. Estimated total time: 18h 15m 16s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 30s, 500 more iterations: 3h 2m 32s. [2025-11-13 09:00:02,181][__main__][INFO] - Starting iteration 160. [2025-11-13 09:00:02,183][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 09:00:02,183][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:00:09,700][__main__][INFO] - Number of regex retries in iteration 160: 0 [2025-11-13 09:00:09,701][__main__][INFO] - agents played in iteration 160 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:00:10,154][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:10,188][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:10,221][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:10,254][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:10,255][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:00:10,255][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:00:10,942][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:00:11,238][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:00:11,568][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:00:11,898][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:00:12,231][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:00:12,562][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:00:12,892][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:00:13,220][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:00:13,548][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:00:13,876][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:00:14,204][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:00:14,532][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:00:14,863][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:00:15,191][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:00:15,523][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:00:15,855][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:00:16,183][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:00:16,514][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:00:16,842][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:00:17,174][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:00:17,509][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:00:17,835][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:00:18,166][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:00:18,495][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:00:18,822][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:00:19,149][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:00:19,478][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:00:19,810][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:00:20,138][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:00:20,464][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:00:20,791][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:00:21,119][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:00:21,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:00:22,170][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:00:22,902][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:00:22,904][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:00:22,906][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:00:24,882][__main__][INFO] - Iteration 161 took 22s (33.11% Gen, 58.17% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 58m 23s. Estimated total time: 18h 54m 59s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 49s, 500 more iterations: 3h 9m 9s. [2025-11-13 09:00:24,884][__main__][INFO] - Starting iteration 161. [2025-11-13 09:00:24,886][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:00:24,887][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:00:32,861][__main__][INFO] - Number of regex retries in iteration 161: 0 [2025-11-13 09:00:32,862][__main__][INFO] - agents played in iteration 161 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:00:33,311][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:33,344][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:33,378][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:33,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:33,412][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:00:33,413][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:00:34,105][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:00:34,400][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:00:34,726][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:00:35,053][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:00:35,378][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:00:35,704][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:00:36,033][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:00:36,361][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:00:36,689][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:00:37,013][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:00:37,342][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:00:37,669][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:00:37,995][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:00:38,323][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:00:38,649][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:00:38,977][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:00:39,306][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:00:39,633][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:00:39,962][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:00:40,288][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:00:40,614][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:00:40,940][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:00:41,267][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:00:41,594][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:00:41,922][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:00:42,249][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:00:42,576][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:00:42,902][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:00:43,228][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:00:43,556][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:00:43,883][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:00:44,212][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:00:44,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:00:45,250][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:00:45,965][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:00:45,966][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:00:45,968][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:00:46,913][__main__][INFO] - Iteration 162 took 22s (36.20% Gen, 59.50% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 24m 25s. Estimated total time: 18h 21m 23s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 42s, 500 more iterations: 3h 3m 33s. [2025-11-13 09:00:46,915][__main__][INFO] - Starting iteration 162. [2025-11-13 09:00:46,918][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:00:46,919][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:00:54,378][__main__][INFO] - Number of regex retries in iteration 162: 0 [2025-11-13 09:00:54,378][__main__][INFO] - agents played in iteration 162 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:00:54,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:54,859][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:54,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:54,927][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:54,927][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:00:54,928][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:00:55,670][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:00:55,967][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:00:56,294][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:00:56,620][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:00:56,947][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:00:57,273][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:00:57,600][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:00:57,925][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:00:58,252][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:00:58,578][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:00:58,905][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:00:59,232][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:00:59,559][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:00:59,885][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:01:00,212][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:01:00,538][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:01:00,866][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:01:01,196][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:01:01,525][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:01:01,851][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:01:02,177][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:01:02,509][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:01:02,834][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:01:03,163][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:01:03,492][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:01:03,818][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:01:04,145][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:01:04,477][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:01:04,805][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:01:05,133][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:01:05,459][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:01:05,786][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:01:06,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:01:06,830][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:01:07,561][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:01:07,563][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:01:07,565][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:01:08,622][__main__][INFO] - Iteration 163 took 21s (34.37% Gen, 60.75% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 7m 54s. Estimated total time: 18h 5m 13s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 10s, 500 more iterations: 3h 0m 52s. [2025-11-13 09:01:08,624][__main__][INFO] - Starting iteration 163. [2025-11-13 09:01:08,627][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:01:08,627][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:01:16,564][__main__][INFO] - Number of regex retries in iteration 163: 0 [2025-11-13 09:01:16,565][__main__][INFO] - agents played in iteration 163 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:01:17,017][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:17,050][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:17,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:17,118][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:17,118][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:01:17,118][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:01:17,842][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:01:18,137][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:01:18,465][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:01:18,795][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:01:19,120][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:01:19,448][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:01:19,778][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:01:20,107][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:01:20,434][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:01:20,759][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:01:21,086][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:01:21,414][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:01:21,743][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:01:22,071][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:01:22,399][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:01:22,727][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:01:23,055][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:01:23,380][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:01:23,707][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:01:24,033][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:01:24,359][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:01:24,686][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:01:25,012][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:01:25,340][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:01:25,670][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:01:25,997][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:01:26,324][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:01:26,651][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:01:26,978][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:01:27,306][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:01:27,631][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:01:27,958][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:01:28,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:01:29,001][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:01:29,731][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:01:29,732][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:01:29,734][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:01:30,944][__main__][INFO] - Iteration 164 took 22s (35.57% Gen, 59.01% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 38m 13s. Estimated total time: 18h 35m 55s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 11s, 500 more iterations: 3h 5m 59s. [2025-11-13 09:01:30,946][__main__][INFO] - Starting iteration 164. [2025-11-13 09:01:30,949][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:01:30,950][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:01:38,810][__main__][INFO] - Number of regex retries in iteration 164: 0 [2025-11-13 09:01:38,811][__main__][INFO] - agents played in iteration 164 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:01:39,263][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:39,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:39,331][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:39,365][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:39,366][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:01:39,366][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:01:40,075][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:01:40,371][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:01:40,698][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:01:41,023][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:01:41,350][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:01:41,678][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:01:42,006][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:01:42,338][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:01:42,666][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:01:42,995][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:01:43,323][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:01:43,649][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:01:43,977][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:01:44,304][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:01:44,632][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:01:44,958][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:01:45,285][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:01:45,611][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:01:45,937][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:01:46,264][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:01:46,589][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:01:46,915][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:01:47,243][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:01:47,568][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:01:47,893][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:01:48,219][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:01:48,546][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:01:48,871][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:01:49,197][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:01:49,525][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:01:49,854][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:01:50,182][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:01:50,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:01:51,229][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:01:51,952][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:01:51,953][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:01:51,955][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:01:52,897][__main__][INFO] - Iteration 165 took 21s (35.81% Gen, 59.88% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 19m 24s. Estimated total time: 18h 17m 28s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 34s, 500 more iterations: 3h 2m 54s. [2025-11-13 09:01:52,899][__main__][INFO] - Starting iteration 165. [2025-11-13 09:01:52,902][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:01:52,902][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:01:59,988][__main__][INFO] - Number of regex retries in iteration 165: 0 [2025-11-13 09:01:59,988][__main__][INFO] - agents played in iteration 165 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:02:00,432][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:00,466][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:00,500][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:00,533][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:00,534][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:02:00,534][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:02:01,246][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:02:01,544][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:02:01,872][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:02:02,199][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:02:02,526][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:02:02,853][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:02:03,185][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:02:03,515][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:02:03,842][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:02:04,168][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:02:04,496][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:02:04,823][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:02:05,149][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:02:05,480][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:02:05,807][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:02:06,133][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:02:06,460][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:02:06,789][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:02:07,116][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:02:07,443][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:02:07,771][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:02:08,095][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:02:08,422][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:02:08,748][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:02:09,074][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:02:09,400][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:02:09,727][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:02:10,053][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:02:10,383][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:02:10,709][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:02:11,038][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:02:11,365][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:02:11,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:02:12,401][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:02:13,126][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:02:13,127][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:02:13,129][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:02:14,095][__main__][INFO] - Iteration 166 took 21s (33.43% Gen, 62.00% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 41m 16s. Estimated total time: 17h 39m 41s. Time estimates for 10 more iterations: 3m 31s, 100 more iterations: 35m 19s, 500 more iterations: 2h 56m 36s. [2025-11-13 09:02:14,097][__main__][INFO] - Starting iteration 166. [2025-11-13 09:02:14,101][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:02:14,101][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:02:21,714][__main__][INFO] - Number of regex retries in iteration 166: 0 [2025-11-13 09:02:21,714][__main__][INFO] - agents played in iteration 166 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:02:22,139][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:22,173][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:22,207][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:22,241][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:22,241][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:02:22,242][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:02:22,963][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:02:23,260][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:02:23,588][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:02:23,915][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:02:24,245][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:02:24,573][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:02:24,901][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:02:25,232][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:02:25,558][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:02:25,884][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:02:26,210][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:02:26,537][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:02:26,864][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:02:27,193][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:02:27,520][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:02:27,849][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:02:28,176][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:02:28,503][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:02:28,829][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:02:29,155][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:02:29,481][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:02:29,808][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:02:30,136][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:02:30,462][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:02:30,787][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:02:31,113][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:02:31,440][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:02:31,770][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:02:32,096][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:02:32,422][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:02:32,749][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:02:33,080][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:02:33,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:02:34,083][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:02:34,799][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:02:34,800][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:02:34,802][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:02:35,759][__main__][INFO] - Iteration 167 took 21s (35.15% Gen, 60.43% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 4m 10s. Estimated total time: 18h 2m 57s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 5s, 500 more iterations: 3h 0m 29s. [2025-11-13 09:02:35,761][__main__][INFO] - Starting iteration 167. [2025-11-13 09:02:35,764][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:02:35,767][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:02:43,771][__main__][INFO] - Number of regex retries in iteration 167: 0 [2025-11-13 09:02:43,771][__main__][INFO] - agents played in iteration 167 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:02:44,226][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:44,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:44,295][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:44,329][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:44,330][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:02:44,330][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:02:45,061][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:02:45,359][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:02:45,688][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:02:46,015][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:02:46,343][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:02:46,671][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:02:46,998][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:02:47,324][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:02:47,651][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:02:47,977][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:02:48,303][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:02:48,628][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:02:48,955][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:02:49,281][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:02:49,607][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:02:49,934][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:02:50,261][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:02:50,589][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:02:50,915][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:02:51,242][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:02:51,567][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:02:51,895][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:02:52,220][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:02:52,545][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:02:52,870][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:02:53,196][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:02:53,524][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:02:53,856][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:02:54,183][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:02:54,511][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:02:54,839][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:02:55,167][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:02:55,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:02:56,196][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:02:56,921][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:02:56,923][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:02:56,925][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:02:58,116][__main__][INFO] - Iteration 168 took 22s (35.81% Gen, 58.84% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 38m 29s. Estimated total time: 18h 37m 38s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 15s, 500 more iterations: 3h 6m 16s. [2025-11-13 09:02:58,118][__main__][INFO] - Starting iteration 168. [2025-11-13 09:02:58,122][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:02:58,123][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:03:06,144][__main__][INFO] - Number of regex retries in iteration 168: 0 [2025-11-13 09:03:06,144][__main__][INFO] - agents played in iteration 168 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:03:06,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:06,624][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:06,657][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:06,691][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:06,692][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:03:06,692][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:03:07,407][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:03:07,705][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:03:08,035][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:03:08,361][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:03:08,688][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:03:09,014][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:03:09,342][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:03:09,669][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:03:09,994][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:03:10,321][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:03:10,646][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:03:10,975][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:03:11,304][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:03:11,632][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:03:11,957][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:03:12,285][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:03:12,612][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:03:12,943][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:03:13,275][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:03:13,604][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:03:13,934][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:03:14,262][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:03:14,588][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:03:14,915][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:03:15,242][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:03:15,570][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:03:15,896][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:03:16,224][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:03:16,549][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:03:16,876][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:03:17,201][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:03:17,528][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:03:17,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:03:18,575][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:03:19,298][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:03:19,299][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:03:19,301][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:03:20,281][__main__][INFO] - Iteration 169 took 22s (36.20% Gen, 59.37% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 28m 28s. Estimated total time: 18h 27m 59s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 55s, 500 more iterations: 3h 4m 39s. [2025-11-13 09:03:20,283][__main__][INFO] - Starting iteration 169. [2025-11-13 09:03:20,287][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:03:20,287][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:03:28,269][__main__][INFO] - Number of regex retries in iteration 169: 0 [2025-11-13 09:03:28,270][__main__][INFO] - agents played in iteration 169 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:03:28,729][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:28,763][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:28,797][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:28,831][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:28,832][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:03:28,832][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:03:29,577][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:03:29,874][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:03:30,200][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:03:30,526][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:03:30,853][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:03:31,179][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:03:31,507][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:03:31,833][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:03:32,161][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:03:32,489][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:03:32,817][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:03:33,143][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:03:33,471][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:03:33,798][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:03:34,124][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:03:34,454][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:03:34,781][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:03:35,107][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:03:35,436][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:03:35,764][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:03:36,090][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:03:36,418][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:03:36,744][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:03:37,071][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:03:37,396][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:03:37,724][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:03:38,055][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:03:38,383][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:03:38,709][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:03:39,034][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:03:39,360][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:03:39,687][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:03:40,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:03:40,728][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:03:41,452][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:03:41,453][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:03:41,455][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:03:42,551][__main__][INFO] - Iteration 170 took 22s (35.85% Gen, 59.22% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 33m 23s. Estimated total time: 18h 33m 16s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 6s, 500 more iterations: 3h 5m 32s. [2025-11-13 09:03:42,554][__main__][INFO] - Starting iteration 170. [2025-11-13 09:03:42,557][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:03:42,557][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:03:49,910][__main__][INFO] - Number of regex retries in iteration 170: 0 [2025-11-13 09:03:49,911][__main__][INFO] - agents played in iteration 170 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:03:50,353][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:50,387][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:50,421][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:50,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:50,455][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:03:50,455][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:03:51,189][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:03:51,488][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:03:51,815][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:03:52,144][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:03:52,471][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:03:52,797][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:03:53,124][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:03:53,451][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:03:53,778][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:03:54,104][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:03:54,431][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:03:54,757][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:03:55,084][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:03:55,409][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:03:55,737][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:03:56,063][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:03:56,391][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:03:56,717][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:03:57,044][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:03:57,370][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:03:57,696][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:03:58,023][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:03:58,350][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:03:58,676][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:03:59,003][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:03:59,331][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:03:59,660][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:03:59,987][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:04:00,316][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:04:00,647][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:04:00,976][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:04:01,306][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:04:01,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:04:02,317][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:04:03,034][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:04:03,035][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:04:03,037][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:04:04,969][__main__][INFO] - Iteration 171 took 22s (32.81% Gen, 58.57% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 40m 24s. Estimated total time: 18h 40m 40s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 21s, 500 more iterations: 3h 6m 46s. [2025-11-13 09:04:04,971][__main__][INFO] - Starting iteration 171. [2025-11-13 09:04:04,975][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:04:04,975][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:04:13,142][__main__][INFO] - Number of regex retries in iteration 171: 0 [2025-11-13 09:04:13,143][__main__][INFO] - agents played in iteration 171 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:04:13,585][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:13,619][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:13,653][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:13,687][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:13,688][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:04:13,688][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:04:14,415][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:04:14,713][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:04:15,041][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:04:15,369][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:04:15,695][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:04:16,023][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:04:16,351][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:04:16,677][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:04:17,006][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:04:17,332][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:04:17,661][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:04:17,986][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:04:18,313][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:04:18,644][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:04:18,973][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:04:19,302][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:04:19,629][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:04:19,957][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:04:20,286][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:04:20,614][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:04:20,942][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:04:21,273][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:04:21,601][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:04:21,930][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:04:22,254][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:04:22,580][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:04:22,906][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:04:23,232][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:04:23,560][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:04:23,888][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:04:24,216][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:04:24,542][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:04:24,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:04:25,561][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:04:26,293][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:04:26,294][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:04:26,296][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:04:27,263][__main__][INFO] - Iteration 172 took 22s (36.64% Gen, 59.01% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 33m 50s. Estimated total time: 18h 34m 28s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 8s, 500 more iterations: 3h 5m 44s. [2025-11-13 09:04:27,265][__main__][INFO] - Starting iteration 172. [2025-11-13 09:04:27,269][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:04:27,269][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:04:35,363][__main__][INFO] - Number of regex retries in iteration 172: 0 [2025-11-13 09:04:35,363][__main__][INFO] - agents played in iteration 172 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:04:35,814][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:35,848][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:35,882][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:35,917][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:35,918][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:04:35,918][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:04:36,642][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:04:36,940][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:04:37,267][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:04:37,592][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:04:37,917][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:04:38,244][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:04:38,569][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:04:38,895][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:04:39,223][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:04:39,553][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:04:39,885][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:04:40,212][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:04:40,542][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:04:40,872][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:04:41,199][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:04:41,532][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:04:41,864][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:04:42,192][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:04:42,517][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:04:42,848][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:04:43,176][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:04:43,503][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:04:43,833][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:04:44,161][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:04:44,489][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:04:44,816][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:04:45,147][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:04:45,478][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:04:45,805][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:04:46,133][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:04:46,460][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:04:46,788][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:04:47,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:04:47,850][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:04:48,595][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:04:48,597][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:04:48,599][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:04:49,878][__main__][INFO] - Iteration 173 took 22s (35.80% Gen, 58.54% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 49m 28s. Estimated total time: 18h 50m 29s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 40s, 500 more iterations: 3h 8m 24s. [2025-11-13 09:04:49,880][__main__][INFO] - Starting iteration 173. [2025-11-13 09:04:49,882][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:04:49,883][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:04:57,681][__main__][INFO] - Number of regex retries in iteration 173: 0 [2025-11-13 09:04:57,682][__main__][INFO] - agents played in iteration 173 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:04:58,129][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:58,162][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:58,195][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:58,229][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:58,230][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:04:58,230][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:04:58,972][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:04:59,267][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:04:59,594][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:04:59,920][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:05:00,246][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:05:00,571][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:05:00,895][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:05:01,221][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:05:01,549][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:05:01,874][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:05:02,200][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:05:02,526][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:05:02,852][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:05:03,178][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:05:03,503][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:05:03,828][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:05:04,153][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:05:04,478][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:05:04,805][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:05:05,130][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:05:05,455][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:05:05,781][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:05:06,109][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:05:06,435][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:05:06,760][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:05:07,085][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:05:07,410][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:05:07,736][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:05:08,061][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:05:08,387][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:05:08,712][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:05:09,038][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:05:09,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:05:10,082][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:05:10,814][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:05:10,815][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:05:10,817][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:05:11,779][__main__][INFO] - Iteration 174 took 21s (35.62% Gen, 59.98% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 13m 30s. Estimated total time: 18h 14m 52s. Time estimates for 10 more iterations: 3m 38s, 100 more iterations: 36m 29s, 500 more iterations: 3h 2m 28s. [2025-11-13 09:05:11,781][__main__][INFO] - Starting iteration 174. [2025-11-13 09:05:11,785][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:05:11,785][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:05:19,700][__main__][INFO] - Number of regex retries in iteration 174: 0 [2025-11-13 09:05:19,701][__main__][INFO] - agents played in iteration 174 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:05:20,143][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:20,177][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:20,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:20,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:20,244][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:05:20,244][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:05:20,989][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:05:21,284][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:05:21,611][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:05:21,937][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:05:22,263][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:05:22,591][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:05:22,918][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:05:23,245][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:05:23,571][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:05:23,898][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:05:24,226][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:05:24,552][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:05:24,882][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:05:25,210][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:05:25,537][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:05:25,863][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:05:26,188][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:05:26,513][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:05:26,840][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:05:27,165][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:05:27,490][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:05:27,817][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:05:28,143][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:05:28,470][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:05:28,796][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:05:29,121][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:05:29,449][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:05:29,775][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:05:30,102][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:05:30,428][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:05:30,754][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:05:31,080][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:05:31,407][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:05:32,139][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:05:32,883][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:05:32,885][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:05:32,887][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:05:33,900][__main__][INFO] - Iteration 175 took 22s (35.79% Gen, 59.62% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 24m 4s. Estimated total time: 18h 25m 49s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 51s, 500 more iterations: 3h 4m 18s. [2025-11-13 09:05:33,902][__main__][INFO] - Starting iteration 175. [2025-11-13 09:05:33,905][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:05:33,906][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:05:42,091][__main__][INFO] - Number of regex retries in iteration 175: 0 [2025-11-13 09:05:42,091][__main__][INFO] - agents played in iteration 175 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:05:42,554][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:42,588][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:42,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:42,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:42,657][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:05:42,658][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:05:43,378][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:05:43,674][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:05:44,004][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:05:44,331][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:05:44,657][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:05:44,985][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:05:45,312][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:05:45,641][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:05:45,967][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:05:46,295][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:05:46,623][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:05:46,950][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:05:47,274][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:05:47,599][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:05:47,927][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:05:48,256][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:05:48,581][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:05:48,909][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:05:49,240][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:05:49,567][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:05:49,893][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:05:50,220][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:05:50,545][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:05:50,871][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:05:51,197][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:05:51,524][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:05:51,849][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:05:52,178][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:05:52,508][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:05:52,834][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:05:53,160][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:05:53,487][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:05:53,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:05:54,528][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:05:55,249][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:05:55,250][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:05:55,252][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:05:56,299][__main__][INFO] - Iteration 176 took 22s (36.55% Gen, 58.77% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 37m 35s. Estimated total time: 18h 39m 42s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 19s, 500 more iterations: 3h 6m 37s. [2025-11-13 09:05:56,301][__main__][INFO] - Starting iteration 176. [2025-11-13 09:05:56,304][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:05:56,304][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:06:04,662][__main__][INFO] - Number of regex retries in iteration 176: 0 [2025-11-13 09:06:04,663][__main__][INFO] - agents played in iteration 176 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:06:05,113][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:05,149][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:05,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:05,217][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:05,218][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:06:05,218][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:06:05,937][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:06:06,233][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:06:06,565][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:06:06,891][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:06:07,217][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:06:07,546][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:06:07,874][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:06:08,200][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:06:08,526][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:06:08,853][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:06:09,177][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:06:09,504][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:06:09,830][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:06:10,155][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:06:10,480][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:06:10,806][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:06:11,132][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:06:11,459][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:06:11,785][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:06:12,111][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:06:12,436][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:06:12,763][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:06:13,088][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:06:13,413][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:06:13,739][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:06:14,066][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:06:14,393][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:06:14,718][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:06:15,043][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:06:15,375][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:06:15,702][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:06:16,027][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:06:16,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:06:17,062][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:06:17,792][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:06:17,793][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:06:17,795][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:06:18,770][__main__][INFO] - Iteration 177 took 22s (37.20% Gen, 58.45% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 40m 52s. Estimated total time: 18h 43m 21s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 26s, 500 more iterations: 3h 7m 13s. [2025-11-13 09:06:18,772][__main__][INFO] - Starting iteration 177. [2025-11-13 09:06:18,775][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:06:18,776][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:06:26,793][__main__][INFO] - Number of regex retries in iteration 177: 0 [2025-11-13 09:06:26,793][__main__][INFO] - agents played in iteration 177 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:06:27,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:27,274][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:27,308][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:27,341][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:27,342][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:06:27,342][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:06:28,089][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:06:28,386][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:06:28,713][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:06:29,040][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:06:29,368][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:06:29,694][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:06:30,021][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:06:30,347][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:06:30,676][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:06:31,003][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:06:31,330][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:06:31,660][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:06:31,987][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:06:32,311][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:06:32,636][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:06:32,962][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:06:33,289][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:06:33,614][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:06:33,940][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:06:34,267][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:06:34,593][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:06:34,919][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:06:35,247][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:06:35,571][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:06:35,898][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:06:36,223][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:06:36,550][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:06:36,878][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:06:37,204][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:06:37,530][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:06:37,857][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:06:38,184][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:06:38,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:06:39,242][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:06:39,985][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:06:39,986][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:06:39,987][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:06:40,954][__main__][INFO] - Iteration 178 took 22s (36.15% Gen, 59.49% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 26m 6s. Estimated total time: 18h 28m 58s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 57s, 500 more iterations: 3h 4m 49s. [2025-11-13 09:06:40,956][__main__][INFO] - Starting iteration 178. [2025-11-13 09:06:40,959][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:06:40,960][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:06:49,552][__main__][INFO] - Number of regex retries in iteration 178: 0 [2025-11-13 09:06:49,553][__main__][INFO] - agents played in iteration 178 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:06:49,996][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:50,030][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:50,064][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:50,098][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:50,099][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:06:50,099][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:06:50,828][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:06:51,128][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:06:51,456][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:06:51,784][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:06:52,109][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:06:52,434][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:06:52,761][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:06:53,087][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:06:53,413][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:06:53,739][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:06:54,063][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:06:54,390][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:06:54,715][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:06:55,041][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:06:55,367][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:06:55,692][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:06:56,020][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:06:56,346][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:06:56,677][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:06:57,002][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:06:57,327][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:06:57,658][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:06:57,985][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:06:58,318][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:06:58,643][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:06:58,969][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:06:59,296][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:06:59,627][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:06:59,952][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:07:00,277][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:07:00,605][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:07:00,931][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:07:01,264][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:07:01,972][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:07:02,713][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:07:02,714][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:07:02,716][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:07:03,692][__main__][INFO] - Iteration 179 took 22s (37.80% Gen, 57.90% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 53m 27s. Estimated total time: 18h 56m 41s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 53s, 500 more iterations: 3h 9m 26s. [2025-11-13 09:07:03,695][__main__][INFO] - Starting iteration 179. [2025-11-13 09:07:03,698][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:07:03,698][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:07:12,275][__main__][INFO] - Number of regex retries in iteration 179: 0 [2025-11-13 09:07:12,276][__main__][INFO] - agents played in iteration 179 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:07:12,721][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:12,755][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:12,789][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:12,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:12,824][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:07:12,824][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:07:13,547][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:07:13,844][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:07:14,173][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:07:14,501][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:07:14,827][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:07:15,155][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:07:15,484][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:07:15,808][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:07:16,135][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:07:16,463][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:07:16,787][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:07:17,113][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:07:17,438][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:07:17,764][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:07:18,090][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:07:18,417][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:07:18,743][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:07:19,068][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:07:19,394][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:07:19,719][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:07:20,046][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:07:20,371][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:07:20,696][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:07:21,023][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:07:21,347][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:07:21,673][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:07:21,999][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:07:22,324][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:07:22,651][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:07:22,976][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:07:23,302][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:07:23,630][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:07:23,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:07:24,651][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:07:25,374][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:07:25,376][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:07:25,377][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:07:26,356][__main__][INFO] - Iteration 180 took 22s (37.85% Gen, 57.82% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 49m 19s. Estimated total time: 18h 52m 56s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 45s, 500 more iterations: 3h 8m 49s. [2025-11-13 09:07:26,358][__main__][INFO] - Starting iteration 180. [2025-11-13 09:07:26,362][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:07:26,363][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:07:34,892][__main__][INFO] - Number of regex retries in iteration 180: 0 [2025-11-13 09:07:34,892][__main__][INFO] - agents played in iteration 180 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:07:35,351][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:35,388][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:35,423][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:35,457][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:35,458][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:07:35,458][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:07:36,169][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:07:36,467][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:07:36,795][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:07:37,123][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:07:37,449][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:07:37,779][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:07:38,105][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:07:38,433][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:07:38,760][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:07:39,086][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:07:39,412][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:07:39,738][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:07:40,065][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:07:40,391][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:07:40,717][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:07:41,042][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:07:41,369][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:07:41,694][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:07:42,021][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:07:42,348][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:07:42,675][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:07:43,001][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:07:43,327][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:07:43,655][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:07:43,979][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:07:44,305][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:07:44,631][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:07:44,957][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:07:45,286][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:07:45,613][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:07:45,941][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:07:46,267][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:07:46,592][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:07:47,295][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:07:48,023][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:07:48,024][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:07:48,026][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:07:50,043][__main__][INFO] - Iteration 181 took 23s (36.02% Gen, 55.46% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 40m 6s. Estimated total time: 19h 44m 7s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 28s, 500 more iterations: 3h 17m 21s. [2025-11-13 09:07:50,045][__main__][INFO] - Starting iteration 181. [2025-11-13 09:07:50,049][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:07:50,049][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:07:58,902][__main__][INFO] - Number of regex retries in iteration 181: 0 [2025-11-13 09:07:58,903][__main__][INFO] - agents played in iteration 181 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:07:59,366][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:59,400][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:59,434][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:59,469][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:59,470][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:07:59,471][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:08:00,201][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:08:00,499][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:08:00,827][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:08:01,154][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:08:01,479][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:08:01,806][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:08:02,135][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:08:02,466][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:08:02,794][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:08:03,122][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:08:03,448][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:08:03,774][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:08:04,099][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:08:04,431][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:08:04,757][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:08:05,083][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:08:05,412][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:08:05,739][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:08:06,067][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:08:06,392][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:08:06,723][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:08:07,055][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:08:07,383][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:08:07,714][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:08:08,047][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:08:08,376][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:08:08,705][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:08:09,033][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:08:09,361][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:08:09,689][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:08:10,019][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:08:10,352][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:08:10,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:08:11,416][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:08:12,168][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:08:12,169][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:08:12,171][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:08:13,173][__main__][INFO] - Iteration 182 took 23s (38.28% Gen, 57.38% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 11m 51s. Estimated total time: 19h 16m 15s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 32s, 500 more iterations: 3h 12m 42s. [2025-11-13 09:08:13,176][__main__][INFO] - Starting iteration 182. [2025-11-13 09:08:13,179][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:08:13,180][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:08:22,163][__main__][INFO] - Number of regex retries in iteration 182: 0 [2025-11-13 09:08:22,164][__main__][INFO] - agents played in iteration 182 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:08:22,614][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:22,648][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:22,682][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:22,716][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:22,717][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:08:22,717][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:08:23,439][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:08:23,735][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:08:24,062][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:08:24,388][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:08:24,714][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:08:25,041][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:08:25,365][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:08:25,691][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:08:26,017][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:08:26,344][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:08:26,670][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:08:26,995][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:08:27,322][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:08:27,648][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:08:27,977][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:08:28,303][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:08:28,631][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:08:28,956][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:08:29,281][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:08:29,607][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:08:29,932][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:08:30,258][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:08:30,583][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:08:30,908][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:08:31,234][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:08:31,559][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:08:31,885][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:08:32,213][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:08:32,538][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:08:32,863][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:08:33,188][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:08:33,515][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:08:33,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:08:34,561][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:08:35,300][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:08:35,302][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:08:35,303][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:08:36,248][__main__][INFO] - Iteration 183 took 23s (38.94% Gen, 56.96% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 8m 42s. Estimated total time: 19h 13m 30s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 27s, 500 more iterations: 3h 12m 15s. [2025-11-13 09:08:36,250][__main__][INFO] - Starting iteration 183. [2025-11-13 09:08:36,253][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:08:36,254][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:08:44,839][__main__][INFO] - Number of regex retries in iteration 183: 0 [2025-11-13 09:08:44,840][__main__][INFO] - agents played in iteration 183 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:08:45,284][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:45,322][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:45,356][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:45,390][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:45,391][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:08:45,391][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:08:46,114][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:08:46,410][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:08:46,740][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:08:47,066][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:08:47,391][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:08:47,717][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:08:48,041][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:08:48,367][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:08:48,692][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:08:49,019][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:08:49,345][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:08:49,669][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:08:49,994][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:08:50,320][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:08:50,645][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:08:50,971][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:08:51,296][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:08:51,621][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:08:51,947][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:08:52,272][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:08:52,598][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:08:52,924][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:08:53,250][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:08:53,577][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:08:53,902][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:08:54,228][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:08:54,554][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:08:54,880][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:08:55,207][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:08:55,532][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:08:55,859][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:08:56,184][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:08:56,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:08:57,227][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:08:57,959][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:08:57,961][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:08:57,962][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:08:58,913][__main__][INFO] - Iteration 184 took 22s (37.89% Gen, 57.91% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 47m 53s. Estimated total time: 18h 53m 2s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 46s, 500 more iterations: 3h 8m 50s. [2025-11-13 09:08:58,922][__main__][INFO] - Starting iteration 184. [2025-11-13 09:08:58,925][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:08:58,925][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:09:07,880][__main__][INFO] - Number of regex retries in iteration 184: 0 [2025-11-13 09:09:07,880][__main__][INFO] - agents played in iteration 184 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:09:08,341][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:08,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:08,411][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:08,445][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:08,446][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:09:08,446][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:09:09,175][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:09:09,474][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:09:09,805][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:09:10,136][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:09:10,465][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:09:10,791][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:09:11,117][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:09:11,444][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:09:11,772][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:09:12,103][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:09:12,431][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:09:12,756][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:09:13,087][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:09:13,412][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:09:13,738][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:09:14,064][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:09:14,388][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:09:14,715][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:09:15,041][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:09:15,367][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:09:15,693][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:09:16,024][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:09:16,353][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:09:16,683][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:09:17,009][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:09:17,334][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:09:17,659][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:09:17,990][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:09:18,317][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:09:18,644][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:09:18,972][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:09:19,297][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:09:19,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:09:20,335][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:09:21,067][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:09:21,069][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:09:21,070][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:09:22,113][__main__][INFO] - Iteration 185 took 23s (38.62% Gen, 56.88% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 13m 55s. Estimated total time: 19h 19m 27s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 38s, 500 more iterations: 3h 13m 14s. [2025-11-13 09:09:22,116][__main__][INFO] - Starting iteration 185. [2025-11-13 09:09:22,119][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:09:22,119][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:09:30,166][__main__][INFO] - Number of regex retries in iteration 185: 0 [2025-11-13 09:09:30,166][__main__][INFO] - agents played in iteration 185 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:09:30,628][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:30,664][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:30,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:30,731][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:30,732][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:09:30,733][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:09:31,476][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:09:31,774][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:09:32,100][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:09:32,426][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:09:32,754][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:09:33,079][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:09:33,409][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:09:33,736][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:09:34,066][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:09:34,393][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:09:34,722][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:09:35,053][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:09:35,385][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:09:35,714][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:09:36,045][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:09:36,372][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:09:36,698][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:09:37,023][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:09:37,349][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:09:37,676][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:09:38,001][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:09:38,330][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:09:38,655][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:09:38,980][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:09:39,309][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:09:39,634][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:09:39,960][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:09:40,286][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:09:40,612][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:09:40,937][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:09:41,265][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:09:41,589][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:09:41,916][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:09:42,623][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:09:43,364][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:09:43,366][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:09:43,367][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:09:44,439][__main__][INFO] - Iteration 186 took 22s (36.05% Gen, 59.15% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 30m 7s. Estimated total time: 18h 36m 2s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 12s, 500 more iterations: 3h 6m 0s. [2025-11-13 09:09:44,451][__main__][INFO] - Starting iteration 186. [2025-11-13 09:09:44,454][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:09:44,454][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:09:53,308][__main__][INFO] - Number of regex retries in iteration 186: 0 [2025-11-13 09:09:53,309][__main__][INFO] - agents played in iteration 186 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:09:53,758][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:53,793][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:53,828][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:53,862][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:53,863][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:09:53,863][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:09:54,597][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:09:54,895][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:09:55,221][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:09:55,547][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:09:55,874][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:09:56,200][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:09:56,527][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:09:56,854][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:09:57,180][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:09:57,505][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:09:57,831][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:09:58,156][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:09:58,482][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:09:58,807][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:09:59,133][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:09:59,458][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:09:59,785][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:10:00,112][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:10:00,437][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:10:00,763][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:10:01,088][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:10:01,414][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:10:01,739][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:10:02,066][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:10:02,391][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:10:02,718][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:10:03,043][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:10:03,372][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:10:03,698][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:10:04,023][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:10:04,349][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:10:04,675][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:10:05,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:10:05,729][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:10:06,481][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:10:06,482][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:10:06,484][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:10:07,479][__main__][INFO] - Iteration 187 took 23s (38.45% Gen, 57.22% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 5m 0s. Estimated total time: 19h 11m 18s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 22s, 500 more iterations: 3h 11m 53s. [2025-11-13 09:10:07,481][__main__][INFO] - Starting iteration 187. [2025-11-13 09:10:07,484][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:10:07,485][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:10:15,963][__main__][INFO] - Number of regex retries in iteration 187: 0 [2025-11-13 09:10:15,963][__main__][INFO] - agents played in iteration 187 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:10:16,411][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:16,445][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:16,478][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:16,512][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:16,513][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:10:16,513][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:10:17,257][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:10:17,555][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:10:17,884][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:10:18,210][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:10:18,536][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:10:18,863][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:10:19,188][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:10:19,515][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:10:19,841][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:10:20,168][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:10:20,494][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:10:20,820][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:10:21,149][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:10:21,475][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:10:21,806][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:10:22,134][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:10:22,461][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:10:22,789][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:10:23,115][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:10:23,440][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:10:23,768][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:10:24,100][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:10:24,431][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:10:24,760][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:10:25,085][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:10:25,411][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:10:25,739][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:10:26,065][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:10:26,391][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:10:26,717][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:10:27,045][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:10:27,371][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:10:27,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:10:28,404][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:10:29,154][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:10:29,155][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:10:29,157][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:10:30,165][__main__][INFO] - Iteration 188 took 22s (37.38% Gen, 58.17% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 47m 23s. Estimated total time: 18h 54m 4s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 48s, 500 more iterations: 3h 9m 0s. [2025-11-13 09:10:30,167][__main__][INFO] - Starting iteration 188. [2025-11-13 09:10:30,171][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:10:30,171][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:10:39,218][__main__][INFO] - Number of regex retries in iteration 188: 0 [2025-11-13 09:10:39,219][__main__][INFO] - agents played in iteration 188 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:10:39,659][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:39,693][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:39,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:39,762][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:39,763][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:10:39,763][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:10:40,486][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:10:40,783][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:10:41,109][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:10:41,434][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:10:41,761][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:10:42,089][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:10:42,419][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:10:42,745][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:10:43,073][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:10:43,401][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:10:43,728][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:10:44,056][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:10:44,384][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:10:44,710][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:10:45,036][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:10:45,365][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:10:45,692][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:10:46,019][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:10:46,345][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:10:46,671][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:10:46,999][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:10:47,327][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:10:47,652][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:10:47,978][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:10:48,304][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:10:48,631][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:10:48,957][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:10:49,283][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:10:49,609][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:10:49,936][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:10:50,261][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:10:50,587][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:10:50,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:10:51,621][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:10:52,368][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:10:52,369][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:10:52,371][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:10:53,451][__main__][INFO] - Iteration 189 took 23s (38.86% Gen, 56.49% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 17m 0s. Estimated total time: 19h 24m 5s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 48s, 500 more iterations: 3h 14m 0s. [2025-11-13 09:10:53,453][__main__][INFO] - Starting iteration 189. [2025-11-13 09:10:53,457][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:10:53,457][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:11:02,502][__main__][INFO] - Number of regex retries in iteration 189: 0 [2025-11-13 09:11:02,502][__main__][INFO] - agents played in iteration 189 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:11:02,965][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:02,999][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:03,033][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:03,067][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:03,067][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:11:03,067][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:11:03,791][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:11:04,088][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:11:04,416][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:11:04,742][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:11:05,067][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:11:05,393][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:11:05,718][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:11:06,045][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:11:06,373][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:11:06,699][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:11:07,028][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:11:07,355][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:11:07,682][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:11:08,007][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:11:08,334][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:11:08,660][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:11:08,986][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:11:09,313][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:11:09,638][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:11:09,964][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:11:10,291][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:11:10,618][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:11:10,943][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:11:11,269][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:11:11,596][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:11:11,921][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:11:12,249][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:11:12,574][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:11:12,898][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:11:13,223][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:11:13,549][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:11:13,875][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:11:14,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:11:14,908][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:11:15,687][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:11:15,689][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:11:15,691][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:11:16,671][__main__][INFO] - Iteration 190 took 23s (38.96% Gen, 56.81% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 13m 19s. Estimated total time: 19h 20m 46s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 41s, 500 more iterations: 3h 13m 27s. [2025-11-13 09:11:16,674][__main__][INFO] - Starting iteration 190. [2025-11-13 09:11:16,677][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:11:16,677][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:11:25,568][__main__][INFO] - Number of regex retries in iteration 190: 0 [2025-11-13 09:11:25,569][__main__][INFO] - agents played in iteration 190 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:11:26,016][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:26,052][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:26,088][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:26,122][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:26,122][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:11:26,123][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:11:26,859][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:11:27,162][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:11:27,489][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:11:27,819][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:11:28,150][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:11:28,477][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:11:28,805][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:11:29,136][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:11:29,463][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:11:29,788][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:11:30,115][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:11:30,442][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:11:30,769][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:11:31,095][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:11:31,423][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:11:31,751][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:11:32,077][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:11:32,403][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:11:32,731][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:11:33,058][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:11:33,386][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:11:33,712][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:11:34,037][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:11:34,365][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:11:34,690][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:11:35,016][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:11:35,342][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:11:35,670][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:11:35,997][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:11:36,329][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:11:36,659][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:11:36,987][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:11:37,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:11:38,032][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:11:38,780][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:11:38,782][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:11:38,783][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:11:40,819][__main__][INFO] - Iteration 191 took 24s (36.83% Gen, 54.73% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 59m 18s. Estimated total time: 20h 7m 9s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 14s, 500 more iterations: 3h 21m 11s. [2025-11-13 09:11:40,821][__main__][INFO] - Starting iteration 191. [2025-11-13 09:11:40,824][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:11:40,824][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:11:50,097][__main__][INFO] - Number of regex retries in iteration 191: 0 [2025-11-13 09:11:50,098][__main__][INFO] - agents played in iteration 191 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:11:50,537][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:50,572][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:50,606][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:50,639][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:50,640][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:11:50,641][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:11:51,344][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:11:51,641][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:11:51,966][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:11:52,292][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:11:52,618][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:11:52,943][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:11:53,269][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:11:53,595][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:11:53,921][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:11:54,247][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:11:54,574][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:11:54,901][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:11:55,227][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:11:55,553][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:11:55,879][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:11:56,206][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:11:56,535][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:11:56,863][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:11:57,190][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:11:57,518][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:11:57,847][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:11:58,175][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:11:58,502][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:11:58,828][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:11:59,154][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:11:59,481][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:11:59,807][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:12:00,133][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:12:00,461][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:12:00,786][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:12:01,114][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:12:01,440][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:12:01,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:12:02,510][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:12:03,262][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:12:03,263][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:12:03,266][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:12:04,157][__main__][INFO] - Iteration 192 took 23s (39.74% Gen, 56.43% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 18m 26s. Estimated total time: 19h 26m 41s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 53s, 500 more iterations: 3h 14m 26s. [2025-11-13 09:12:04,159][__main__][INFO] - Starting iteration 192. [2025-11-13 09:12:04,161][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:12:04,162][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:12:13,278][__main__][INFO] - Number of regex retries in iteration 192: 0 [2025-11-13 09:12:13,279][__main__][INFO] - agents played in iteration 192 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:12:13,716][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:13,751][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:13,785][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:13,819][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:13,820][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:12:13,820][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:12:14,516][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:12:14,811][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:12:15,139][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:12:15,466][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:12:15,793][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:12:16,120][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:12:16,446][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:12:16,773][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:12:17,099][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:12:17,425][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:12:17,754][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:12:18,082][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:12:18,410][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:12:18,737][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:12:19,063][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:12:19,391][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:12:19,718][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:12:20,045][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:12:20,371][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:12:20,697][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:12:21,022][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:12:21,353][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:12:21,684][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:12:22,010][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:12:22,336][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:12:22,662][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:12:22,988][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:12:23,317][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:12:23,642][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:12:23,968][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:12:24,295][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:12:24,622][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:12:24,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:12:25,667][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:12:26,418][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:12:26,419][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:12:26,421][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:12:27,309][__main__][INFO] - Iteration 193 took 23s (39.39% Gen, 56.77% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 8m 45s. Estimated total time: 19h 17m 24s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 34s, 500 more iterations: 3h 12m 54s. [2025-11-13 09:12:27,310][__main__][INFO] - Starting iteration 193. [2025-11-13 09:12:27,313][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:12:27,314][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:12:36,064][__main__][INFO] - Number of regex retries in iteration 193: 0 [2025-11-13 09:12:36,065][__main__][INFO] - agents played in iteration 193 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:12:36,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:36,557][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:36,591][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:36,625][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:36,625][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:12:36,626][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:12:37,303][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:12:37,599][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:12:37,930][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:12:38,257][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:12:38,585][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:12:38,912][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:12:39,241][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:12:39,566][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:12:39,895][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:12:40,222][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:12:40,550][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:12:40,877][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:12:41,209][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:12:41,540][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:12:41,869][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:12:42,195][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:12:42,520][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:12:42,849][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:12:43,175][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:12:43,500][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:12:43,826][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:12:44,155][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:12:44,487][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:12:44,813][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:12:45,138][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:12:45,463][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:12:45,790][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:12:46,116][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:12:46,441][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:12:46,769][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:12:47,095][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:12:47,421][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:12:47,748][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:12:48,470][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:12:49,202][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:12:49,203][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:12:49,205][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:12:50,107][__main__][INFO] - Iteration 194 took 22s (38.39% Gen, 57.65% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 50m 41s. Estimated total time: 18h 59m 42s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 59s, 500 more iterations: 3h 9m 57s. [2025-11-13 09:12:50,109][__main__][INFO] - Starting iteration 194. [2025-11-13 09:12:50,111][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:12:50,112][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:12:58,850][__main__][INFO] - Number of regex retries in iteration 194: 0 [2025-11-13 09:12:58,850][__main__][INFO] - agents played in iteration 194 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:12:59,295][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:59,329][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:59,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:59,396][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:59,396][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:12:59,397][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:13:00,126][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:13:00,421][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:13:00,748][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:13:01,075][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:13:01,402][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:13:01,727][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:13:02,054][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:13:02,379][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:13:02,705][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:13:03,033][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:13:03,361][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:13:03,688][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:13:04,014][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:13:04,340][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:13:04,670][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:13:04,997][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:13:05,322][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:13:05,648][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:13:05,976][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:13:06,303][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:13:06,633][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:13:06,960][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:13:07,286][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:13:07,611][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:13:07,939][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:13:08,270][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:13:08,596][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:13:08,927][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:13:09,254][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:13:09,581][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:13:09,907][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:13:10,234][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:13:10,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:13:11,277][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:13:12,017][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:13:12,019][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:13:12,020][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:13:13,000][__main__][INFO] - Iteration 195 took 22s (38.18% Gen, 57.54% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 55m 3s. Estimated total time: 19h 4m 27s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 8s, 500 more iterations: 3h 10m 44s. [2025-11-13 09:13:13,002][__main__][INFO] - Starting iteration 195. [2025-11-13 09:13:13,005][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:13:13,005][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:13:21,695][__main__][INFO] - Number of regex retries in iteration 195: 0 [2025-11-13 09:13:21,696][__main__][INFO] - agents played in iteration 195 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:13:22,145][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:22,179][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:22,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:22,247][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:22,247][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:13:22,247][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:13:22,963][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:13:23,261][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:13:23,587][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:13:23,915][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:13:24,247][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:13:24,574][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:13:24,900][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:13:25,228][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:13:25,556][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:13:25,883][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:13:26,210][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:13:26,537][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:13:26,862][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:13:27,188][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:13:27,513][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:13:27,838][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:13:28,165][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:13:28,490][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:13:28,818][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:13:29,144][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:13:29,471][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:13:29,802][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:13:30,134][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:13:30,462][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:13:30,788][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:13:31,112][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:13:31,439][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:13:31,766][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:13:32,094][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:13:32,426][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:13:32,754][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:13:33,083][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:13:33,412][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:13:34,128][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:13:34,868][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:13:34,870][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:13:34,871][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:13:35,882][__main__][INFO] - Iteration 196 took 22s (37.99% Gen, 57.59% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 54m 7s. Estimated total time: 19h 3m 54s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 7s, 500 more iterations: 3h 10m 39s. [2025-11-13 09:13:35,886][__main__][INFO] - Starting iteration 196. [2025-11-13 09:13:35,889][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:13:35,889][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:13:45,033][__main__][INFO] - Number of regex retries in iteration 196: 0 [2025-11-13 09:13:45,034][__main__][INFO] - agents played in iteration 196 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:13:45,482][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:45,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:45,554][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:45,588][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:45,589][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:13:45,589][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:13:46,315][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:13:46,613][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:13:46,940][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:13:47,268][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:13:47,595][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:13:47,921][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:13:48,247][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:13:48,572][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:13:48,897][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:13:49,222][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:13:49,548][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:13:49,873][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:13:50,199][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:13:50,525][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:13:50,850][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:13:51,176][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:13:51,502][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:13:51,829][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:13:52,155][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:13:52,480][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:13:52,806][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:13:53,132][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:13:53,458][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:13:53,786][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:13:54,113][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:13:54,441][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:13:54,769][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:13:55,094][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:13:55,419][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:13:55,745][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:13:56,071][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:13:56,403][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:13:56,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:13:57,480][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:13:58,236][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:13:58,238][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:13:58,239][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:13:59,196][__main__][INFO] - Iteration 197 took 23s (39.23% Gen, 56.66% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 15m 14s. Estimated total time: 19h 25m 24s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 50s, 500 more iterations: 3h 14m 14s. [2025-11-13 09:13:59,198][__main__][INFO] - Starting iteration 197. [2025-11-13 09:13:59,202][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:13:59,202][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:14:08,348][__main__][INFO] - Number of regex retries in iteration 197: 0 [2025-11-13 09:14:08,348][__main__][INFO] - agents played in iteration 197 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:14:08,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:08,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:08,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:08,907][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:08,908][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:14:08,909][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:14:09,609][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:14:09,904][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:14:10,230][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:14:10,557][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:14:10,882][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:14:11,208][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:14:11,536][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:14:11,865][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:14:12,193][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:14:12,519][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:14:12,845][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:14:13,171][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:14:13,496][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:14:13,822][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:14:14,148][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:14:14,474][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:14:14,800][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:14:15,126][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:14:15,451][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:14:15,779][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:14:16,105][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:14:16,430][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:14:16,756][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:14:17,082][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:14:17,407][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:14:17,734][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:14:18,060][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:14:18,387][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:14:18,713][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:14:19,038][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:14:19,366][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:14:19,694][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:14:20,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:14:20,739][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:14:21,495][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:14:21,497][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:14:21,499][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:14:22,660][__main__][INFO] - Iteration 198 took 23s (38.99% Gen, 56.06% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 22m 24s. Estimated total time: 19h 32m 57s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 5s, 500 more iterations: 3h 15m 29s. [2025-11-13 09:14:22,662][__main__][INFO] - Starting iteration 198. [2025-11-13 09:14:22,665][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:14:22,665][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:14:31,497][__main__][INFO] - Number of regex retries in iteration 198: 0 [2025-11-13 09:14:31,498][__main__][INFO] - agents played in iteration 198 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:14:31,944][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:31,980][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:32,014][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:32,048][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:32,049][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:14:32,049][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:14:32,769][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:14:33,066][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:14:33,394][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:14:33,721][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:14:34,046][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:14:34,374][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:14:34,699][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:14:35,025][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:14:35,349][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:14:35,676][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:14:36,003][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:14:36,328][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:14:36,657][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:14:36,982][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:14:37,307][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:14:37,633][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:14:37,958][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:14:38,284][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:14:38,610][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:14:38,936][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:14:39,262][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:14:39,589][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:14:39,914][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:14:40,240][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:14:40,565][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:14:40,892][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:14:41,218][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:14:41,543][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:14:41,871][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:14:42,196][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:14:42,522][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:14:42,849][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:14:43,177][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:14:43,894][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:14:44,648][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:14:44,650][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:14:44,651][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:14:45,632][__main__][INFO] - Iteration 199 took 22s (38.45% Gen, 57.27% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 57m 28s. Estimated total time: 19h 8m 24s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 16s, 500 more iterations: 3h 11m 24s. [2025-11-13 09:14:45,634][__main__][INFO] - Starting iteration 199. [2025-11-13 09:14:45,636][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:14:45,637][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:14:54,759][__main__][INFO] - Number of regex retries in iteration 199: 0 [2025-11-13 09:14:54,759][__main__][INFO] - agents played in iteration 199 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:14:55,201][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:55,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:55,271][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:55,305][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:55,306][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:14:55,306][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:14:56,016][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:14:56,312][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:14:56,640][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:14:56,965][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:14:57,291][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:14:57,617][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:14:57,943][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:14:58,269][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:14:58,595][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:14:58,922][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:14:59,247][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:14:59,573][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:14:59,899][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:15:00,225][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:15:00,553][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:15:00,880][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:15:01,206][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:15:01,532][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:15:01,859][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:15:02,185][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:15:02,512][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:15:02,839][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:15:03,168][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:15:03,493][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:15:03,820][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:15:04,148][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:15:04,479][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:15:04,807][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:15:05,134][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:15:05,460][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:15:05,787][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:15:06,113][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:15:06,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:15:07,158][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:15:07,915][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:15:07,917][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:15:07,919][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:15:08,871][__main__][INFO] - Iteration 200 took 23s (39.26% Gen, 56.63% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 10m 27s. Estimated total time: 19h 21m 47s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 43s, 500 more iterations: 3h 13m 37s. [2025-11-13 09:15:08,874][__main__][INFO] - Starting iteration 200. [2025-11-13 09:15:08,876][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:15:08,877][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:15:17,747][__main__][INFO] - Number of regex retries in iteration 200: 0 [2025-11-13 09:15:17,747][__main__][INFO] - agents played in iteration 200 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:15:18,185][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:18,219][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:18,253][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:18,287][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:18,288][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:15:18,289][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:15:19,026][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:15:19,322][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:15:19,651][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:15:19,977][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:15:20,303][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:15:20,629][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:15:20,956][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:15:21,283][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:15:21,609][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:15:21,936][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:15:22,263][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:15:22,589][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:15:22,916][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:15:23,241][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:15:23,568][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:15:23,893][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:15:24,219][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:15:24,546][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:15:24,872][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:15:25,198][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:15:25,525][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:15:25,850][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:15:26,176][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:15:26,503][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:15:26,829][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:15:27,155][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:15:27,481][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:15:27,807][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:15:28,133][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:15:28,460][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:15:28,786][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:15:29,112][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:15:29,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:15:30,168][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:15:30,913][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:15:30,915][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:15:30,916][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:15:33,161][__main__][INFO] - Iteration 201 took 24s (36.52% Gen, 54.23% Train). Generation: 8s, Training: 13s. Estimated remaining time: 19h 2m 32s. Estimated total time: 20h 14m 15s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 28s, 500 more iterations: 3h 22m 22s. [2025-11-13 09:15:33,184][__main__][INFO] - Starting iteration 201. [2025-11-13 09:15:33,187][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:15:33,188][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:15:42,601][__main__][INFO] - Number of regex retries in iteration 201: 0 [2025-11-13 09:15:42,602][__main__][INFO] - agents played in iteration 201 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:15:43,053][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:43,087][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:43,121][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:43,154][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:43,155][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:15:43,155][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:15:43,841][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:15:44,136][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:15:44,462][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:15:44,787][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:15:45,112][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:15:45,437][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:15:45,763][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:15:46,091][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:15:46,422][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:15:46,748][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:15:47,074][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:15:47,399][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:15:47,726][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:15:48,053][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:15:48,379][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:15:48,708][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:15:49,033][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:15:49,361][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:15:49,687][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:15:50,012][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:15:50,338][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:15:50,664][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:15:50,989][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:15:51,314][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:15:51,641][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:15:51,968][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:15:52,293][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:15:52,619][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:15:52,946][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:15:53,272][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:15:53,598][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:15:53,925][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:15:54,254][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:15:54,983][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:15:55,725][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:15:55,727][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:15:55,728][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:15:56,654][__main__][INFO] - Iteration 202 took 23s (40.11% Gen, 55.93% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 21m 15s. Estimated total time: 19h 33m 23s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 6s, 500 more iterations: 3h 15m 33s. [2025-11-13 09:15:56,656][__main__][INFO] - Starting iteration 202. [2025-11-13 09:15:56,659][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:15:56,659][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:16:05,583][__main__][INFO] - Number of regex retries in iteration 202: 0 [2025-11-13 09:16:05,584][__main__][INFO] - agents played in iteration 202 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:16:06,025][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:06,058][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:06,092][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:06,126][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:06,126][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:16:06,126][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:16:06,852][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:16:07,147][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:16:07,474][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:16:07,799][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:16:08,125][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:16:08,451][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:16:08,780][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:16:09,106][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:16:09,432][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:16:09,758][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:16:10,085][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:16:10,417][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:16:10,745][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:16:11,075][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:16:11,404][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:16:11,734][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:16:12,062][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:16:12,390][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:16:12,716][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:16:13,044][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:16:13,374][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:16:13,699][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:16:14,029][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:16:14,360][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:16:14,685][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:16:15,014][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:16:15,338][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:16:15,664][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:16:15,991][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:16:16,317][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:16:16,645][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:16:16,972][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:16:17,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:16:18,019][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:16:18,770][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:16:18,771][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:16:18,773][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:16:19,728][__main__][INFO] - Iteration 203 took 23s (38.68% Gen, 57.17% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 1m 0s. Estimated total time: 19h 13m 31s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 27s, 500 more iterations: 3h 12m 15s. [2025-11-13 09:16:19,730][__main__][INFO] - Starting iteration 203. [2025-11-13 09:16:19,733][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:16:19,733][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:16:28,965][__main__][INFO] - Number of regex retries in iteration 203: 0 [2025-11-13 09:16:28,966][__main__][INFO] - agents played in iteration 203 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:16:29,406][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:29,440][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:29,475][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:29,510][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:29,510][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:16:29,510][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:16:30,244][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:16:30,540][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:16:30,867][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:16:31,195][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:16:31,521][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:16:31,849][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:16:32,181][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:16:32,509][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:16:32,835][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:16:33,162][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:16:33,487][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:16:33,815][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:16:34,141][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:16:34,469][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:16:34,800][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:16:35,129][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:16:35,457][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:16:35,783][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:16:36,110][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:16:36,439][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:16:36,767][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:16:37,093][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:16:37,424][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:16:37,755][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:16:38,083][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:16:38,409][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:16:38,737][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:16:39,062][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:16:39,387][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:16:39,715][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:16:40,041][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:16:40,369][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:16:40,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:16:41,426][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:16:42,175][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:16:42,177][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:16:42,178][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:16:43,130][__main__][INFO] - Iteration 204 took 23s (39.46% Gen, 56.47% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 17m 0s. Estimated total time: 19h 29m 54s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 59s, 500 more iterations: 3h 14m 59s. [2025-11-13 09:16:43,133][__main__][INFO] - Starting iteration 204. [2025-11-13 09:16:43,135][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:16:43,136][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:16:52,347][__main__][INFO] - Number of regex retries in iteration 204: 0 [2025-11-13 09:16:52,348][__main__][INFO] - agents played in iteration 204 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:16:52,798][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:52,831][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:52,865][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:52,898][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:52,899][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:16:52,899][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:16:53,634][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:16:53,933][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:16:54,261][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:16:54,588][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:16:54,915][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:16:55,246][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:16:55,574][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:16:55,903][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:16:56,232][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:16:56,563][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:16:56,892][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:16:57,222][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:16:57,551][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:16:57,880][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:16:58,209][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:16:58,537][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:16:58,862][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:16:59,188][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:16:59,513][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:16:59,838][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:17:00,167][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:17:00,494][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:17:00,819][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:17:01,145][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:17:01,470][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:17:01,797][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:17:02,123][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:17:02,448][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:17:02,774][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:17:03,101][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:17:03,426][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:17:03,754][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:17:04,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:17:04,808][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:17:05,590][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:17:05,591][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:17:05,593][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:17:06,595][__main__][INFO] - Iteration 205 took 23s (39.27% Gen, 56.46% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 19m 44s. Estimated total time: 19h 33m 1s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 6s, 500 more iterations: 3h 15m 30s. [2025-11-13 09:17:06,597][__main__][INFO] - Starting iteration 205. [2025-11-13 09:17:06,600][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:17:06,600][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:17:15,957][__main__][INFO] - Number of regex retries in iteration 205: 0 [2025-11-13 09:17:15,957][__main__][INFO] - agents played in iteration 205 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:17:16,406][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:16,443][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:16,477][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:16,511][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:16,512][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:17:16,512][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:17:17,259][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:17:17,557][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:17:17,884][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:17:18,212][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:17:18,541][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:17:18,867][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:17:19,194][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:17:19,520][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:17:19,847][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:17:20,173][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:17:20,504][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:17:20,830][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:17:21,156][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:17:21,483][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:17:21,809][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:17:22,135][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:17:22,460][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:17:22,786][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:17:23,112][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:17:23,438][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:17:23,766][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:17:24,093][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:17:24,420][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:17:24,747][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:17:25,073][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:17:25,402][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:17:25,729][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:17:26,056][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:17:26,383][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:17:26,709][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:17:27,035][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:17:27,362][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:17:27,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:17:28,427][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:17:29,196][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:17:29,198][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:17:29,199][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:17:30,256][__main__][INFO] - Iteration 206 took 23s (39.55% Gen, 55.98% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 29m 9s. Estimated total time: 19h 42m 50s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 25s, 500 more iterations: 3h 17m 8s. [2025-11-13 09:17:30,258][__main__][INFO] - Starting iteration 206. [2025-11-13 09:17:30,261][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:17:30,262][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:17:39,909][__main__][INFO] - Number of regex retries in iteration 206: 0 [2025-11-13 09:17:39,910][__main__][INFO] - agents played in iteration 206 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:17:40,355][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:40,393][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:40,432][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:40,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:40,472][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:17:40,473][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:17:41,188][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:17:41,485][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:17:41,812][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:17:42,137][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:17:42,464][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:17:42,790][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:17:43,118][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:17:43,445][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:17:43,771][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:17:44,098][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:17:44,423][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:17:44,752][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:17:45,080][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:17:45,406][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:17:45,733][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:17:46,060][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:17:46,385][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:17:46,711][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:17:47,036][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:17:47,361][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:17:47,688][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:17:48,014][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:17:48,340][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:17:48,666][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:17:48,991][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:17:49,318][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:17:49,643][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:17:49,971][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:17:50,297][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:17:50,628][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:17:50,954][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:17:51,280][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:17:51,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:17:52,327][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:17:53,070][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:17:53,072][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:17:53,074][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:17:54,023][__main__][INFO] - Iteration 207 took 23s (40.60% Gen, 55.40% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 34m 1s. Estimated total time: 19h 48m 6s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 36s, 500 more iterations: 3h 18m 1s. [2025-11-13 09:17:54,025][__main__][INFO] - Starting iteration 207. [2025-11-13 09:17:54,027][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:17:54,027][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:18:03,410][__main__][INFO] - Number of regex retries in iteration 207: 0 [2025-11-13 09:18:03,410][__main__][INFO] - agents played in iteration 207 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:18:03,855][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:03,892][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:03,926][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:03,960][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:03,961][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:18:03,961][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:18:04,661][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:18:04,956][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:18:05,283][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:18:05,608][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:18:05,933][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:18:06,259][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:18:06,585][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:18:06,913][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:18:07,238][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:18:07,564][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:18:07,889][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:18:08,215][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:18:08,541][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:18:08,868][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:18:09,194][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:18:09,520][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:18:09,846][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:18:10,173][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:18:10,500][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:18:10,826][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:18:11,150][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:18:11,477][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:18:11,801][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:18:12,129][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:18:12,456][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:18:12,782][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:18:13,108][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:18:13,435][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:18:13,760][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:18:14,086][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:18:14,411][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:18:14,737][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:18:15,064][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:18:15,786][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:18:16,522][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:18:16,523][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:18:16,525][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:18:17,466][__main__][INFO] - Iteration 208 took 23s (40.03% Gen, 55.95% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 17m 30s. Estimated total time: 19h 31m 58s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 3s, 500 more iterations: 3h 15m 19s. [2025-11-13 09:18:17,468][__main__][INFO] - Starting iteration 208. [2025-11-13 09:18:17,470][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:18:17,471][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:18:26,743][__main__][INFO] - Number of regex retries in iteration 208: 0 [2025-11-13 09:18:26,744][__main__][INFO] - agents played in iteration 208 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:18:27,187][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:27,221][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:27,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:27,290][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:27,290][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:18:27,291][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:18:28,014][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:18:28,309][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:18:28,636][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:18:28,961][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:18:29,287][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:18:29,614][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:18:29,938][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:18:30,263][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:18:30,590][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:18:30,916][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:18:31,243][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:18:31,570][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:18:31,897][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:18:32,222][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:18:32,548][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:18:32,873][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:18:33,199][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:18:33,527][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:18:33,851][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:18:34,177][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:18:34,503][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:18:34,829][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:18:35,156][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:18:35,482][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:18:35,807][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:18:36,134][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:18:36,460][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:18:36,787][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:18:37,112][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:18:37,438][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:18:37,763][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:18:38,089][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:18:38,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:18:39,137][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:18:39,893][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:18:39,894][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:18:39,896][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:18:40,830][__main__][INFO] - Iteration 209 took 23s (39.69% Gen, 56.30% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 13m 8s. Estimated total time: 19h 28m 0s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 56s, 500 more iterations: 3h 14m 40s. [2025-11-13 09:18:40,832][__main__][INFO] - Starting iteration 209. [2025-11-13 09:18:40,834][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:18:40,834][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:18:49,220][__main__][INFO] - Number of regex retries in iteration 209: 0 [2025-11-13 09:18:49,221][__main__][INFO] - agents played in iteration 209 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:18:49,666][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:49,703][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:49,737][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:49,771][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:49,772][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:18:49,772][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:18:50,523][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:18:50,822][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:18:51,154][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:18:51,486][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:18:51,816][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:18:52,144][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:18:52,471][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:18:52,796][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:18:53,123][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:18:53,449][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:18:53,775][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:18:54,101][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:18:54,428][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:18:54,759][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:18:55,085][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:18:55,412][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:18:55,738][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:18:56,064][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:18:56,392][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:18:56,718][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:18:57,044][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:18:57,371][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:18:57,697][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:18:58,023][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:18:58,349][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:18:58,675][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:18:59,003][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:18:59,328][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:18:59,656][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:18:59,982][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:19:00,308][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:19:00,635][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:19:00,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:19:01,666][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:19:02,417][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:19:02,418][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:19:02,420][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:19:03,718][__main__][INFO] - Iteration 210 took 22s (36.64% Gen, 57.68% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 49m 1s. Estimated total time: 19h 4m 15s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 8s, 500 more iterations: 3h 10m 42s. [2025-11-13 09:19:03,721][__main__][INFO] - Starting iteration 210. [2025-11-13 09:19:03,724][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:19:03,725][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:19:12,921][__main__][INFO] - Number of regex retries in iteration 210: 0 [2025-11-13 09:19:12,921][__main__][INFO] - agents played in iteration 210 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:19:13,357][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:13,394][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:13,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:13,462][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:13,463][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:19:13,463][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:19:14,192][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:19:14,489][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:19:14,817][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:19:15,149][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:19:15,480][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:19:15,807][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:19:16,134][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:19:16,462][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:19:16,787][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:19:17,113][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:19:17,442][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:19:17,769][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:19:18,093][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:19:18,418][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:19:18,743][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:19:19,069][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:19:19,394][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:19:19,724][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:19:20,049][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:19:20,377][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:19:20,701][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:19:21,027][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:19:21,356][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:19:21,684][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:19:22,009][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:19:22,341][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:19:22,669][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:19:22,995][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:19:23,325][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:19:23,653][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:19:23,982][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:19:24,306][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:19:24,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:19:25,317][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:19:26,068][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:19:26,069][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:19:26,071][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:19:28,101][__main__][INFO] - Iteration 211 took 24s (37.72% Gen, 53.94% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 3m 15s. Estimated total time: 20h 18m 54s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 37s, 500 more iterations: 3h 23m 9s. [2025-11-13 09:19:28,104][__main__][INFO] - Starting iteration 211. [2025-11-13 09:19:28,107][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:19:28,107][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:19:37,530][__main__][INFO] - Number of regex retries in iteration 211: 0 [2025-11-13 09:19:37,530][__main__][INFO] - agents played in iteration 211 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:19:37,967][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:38,003][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:38,037][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:38,071][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:38,072][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:19:38,072][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:19:38,805][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:19:39,102][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:19:39,430][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:19:39,758][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:19:40,084][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:19:40,410][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:19:40,737][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:19:41,062][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:19:41,388][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:19:41,714][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:19:42,040][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:19:42,365][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:19:42,692][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:19:43,018][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:19:43,344][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:19:43,671][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:19:43,998][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:19:44,324][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:19:44,650][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:19:44,976][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:19:45,301][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:19:45,627][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:19:45,954][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:19:46,280][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:19:46,607][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:19:46,933][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:19:47,259][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:19:47,586][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:19:47,912][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:19:48,237][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:19:48,565][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:19:48,892][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:19:49,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:19:49,905][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:19:50,647][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:19:50,649][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:19:50,650][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:19:51,626][__main__][INFO] - Iteration 212 took 23s (40.07% Gen, 55.78% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 19m 56s. Estimated total time: 19h 35m 58s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 11s, 500 more iterations: 3h 15m 59s. [2025-11-13 09:19:51,628][__main__][INFO] - Starting iteration 212. [2025-11-13 09:19:51,631][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:19:51,632][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:20:01,115][__main__][INFO] - Number of regex retries in iteration 212: 0 [2025-11-13 09:20:01,116][__main__][INFO] - agents played in iteration 212 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:20:01,551][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:01,585][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:01,619][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:01,653][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:01,654][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:20:01,654][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:20:02,385][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:20:02,682][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:20:03,013][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:20:03,342][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:20:03,669][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:20:03,994][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:20:04,325][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:20:04,651][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:20:04,976][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:20:05,301][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:20:05,633][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:20:05,958][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:20:06,289][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:20:06,617][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:20:06,947][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:20:07,272][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:20:07,598][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:20:07,924][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:20:08,250][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:20:08,575][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:20:08,901][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:20:09,226][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:20:09,553][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:20:09,878][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:20:10,205][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:20:10,529][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:20:10,858][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:20:11,184][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:20:11,509][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:20:11,834][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:20:12,159][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:20:12,485][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:20:12,812][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:20:13,499][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:20:14,243][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:20:14,245][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:20:14,247][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:20:15,218][__main__][INFO] - Iteration 213 took 23s (40.21% Gen, 55.67% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 22m 56s. Estimated total time: 19h 39m 22s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 18s, 500 more iterations: 3h 16m 33s. [2025-11-13 09:20:15,220][__main__][INFO] - Starting iteration 213. [2025-11-13 09:20:15,224][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:20:15,224][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:20:24,802][__main__][INFO] - Number of regex retries in iteration 213: 0 [2025-11-13 09:20:24,802][__main__][INFO] - agents played in iteration 213 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:20:25,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:25,272][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:25,306][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:25,340][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:25,340][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:20:25,341][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:20:26,075][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:20:26,372][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:20:26,700][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:20:27,029][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:20:27,358][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:20:27,683][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:20:28,008][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:20:28,336][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:20:28,663][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:20:28,989][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:20:29,316][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:20:29,642][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:20:29,969][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:20:30,301][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:20:30,632][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:20:30,961][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:20:31,292][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:20:31,619][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:20:31,948][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:20:32,279][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:20:32,605][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:20:32,933][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:20:33,261][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:20:33,591][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:20:33,922][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:20:34,252][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:20:34,583][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:20:34,911][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:20:35,240][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:20:35,568][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:20:35,898][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:20:36,230][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:20:36,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:20:37,289][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:20:38,042][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:20:38,044][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:20:38,045][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:20:39,146][__main__][INFO] - Iteration 214 took 23s (40.04% Gen, 55.35% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 39m 21s. Estimated total time: 19h 56m 11s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 52s, 500 more iterations: 3h 19m 21s. [2025-11-13 09:20:39,148][__main__][INFO] - Starting iteration 214. [2025-11-13 09:20:39,152][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:20:39,153][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:20:48,645][__main__][INFO] - Number of regex retries in iteration 214: 0 [2025-11-13 09:20:48,645][__main__][INFO] - agents played in iteration 214 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:20:49,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:49,119][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:49,153][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:49,188][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:49,189][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:20:49,189][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:20:49,927][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:20:50,223][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:20:50,551][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:20:50,883][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:20:51,211][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:20:51,537][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:20:51,865][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:20:52,191][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:20:52,516][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:20:52,842][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:20:53,168][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:20:53,495][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:20:53,822][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:20:54,151][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:20:54,486][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:20:54,813][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:20:55,140][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:20:55,465][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:20:55,792][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:20:56,119][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:20:56,445][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:20:56,770][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:20:57,097][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:20:57,423][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:20:57,749][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:20:58,081][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:20:58,408][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:20:58,735][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:20:59,064][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:20:59,391][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:20:59,717][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:21:00,045][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:21:00,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:21:01,097][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:21:02,016][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:21:02,018][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:21:02,033][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:21:03,142][__main__][INFO] - Iteration 215 took 23s (39.57% Gen, 55.80% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 42m 16s. Estimated total time: 19h 59m 30s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 59s, 500 more iterations: 3h 19m 55s. [2025-11-13 09:21:03,143][__main__][INFO] - Starting iteration 215. [2025-11-13 09:21:03,146][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:21:03,147][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:21:12,338][__main__][INFO] - Number of regex retries in iteration 215: 0 [2025-11-13 09:21:12,339][__main__][INFO] - agents played in iteration 215 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:21:12,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:12,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:12,853][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:12,887][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:12,888][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:21:12,888][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:21:13,615][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:21:13,913][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:21:14,240][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:21:14,565][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:21:14,892][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:21:15,219][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:21:15,545][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:21:15,872][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:21:16,198][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:21:16,527][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:21:16,858][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:21:17,185][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:21:17,517][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:21:17,843][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:21:18,168][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:21:18,493][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:21:18,819][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:21:19,144][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:21:19,469][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:21:19,795][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:21:20,122][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:21:20,448][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:21:20,776][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:21:21,101][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:21:21,428][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:21:21,756][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:21:22,083][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:21:22,411][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:21:22,739][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:21:23,066][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:21:23,393][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:21:23,721][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:21:24,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:21:24,775][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:21:25,525][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:21:25,527][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:21:25,529][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:21:26,459][__main__][INFO] - Iteration 216 took 23s (39.43% Gen, 56.57% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 8m 5s. Estimated total time: 19h 25m 42s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 51s, 500 more iterations: 3h 14m 17s. [2025-11-13 09:21:26,461][__main__][INFO] - Starting iteration 216. [2025-11-13 09:21:26,464][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:21:26,465][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:21:35,882][__main__][INFO] - Number of regex retries in iteration 216: 0 [2025-11-13 09:21:35,882][__main__][INFO] - agents played in iteration 216 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:21:36,327][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:36,361][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:36,396][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:36,430][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:36,431][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:21:36,431][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:21:37,160][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:21:37,458][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:21:37,784][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:21:38,109][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:21:38,436][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:21:38,761][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:21:39,089][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:21:39,415][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:21:39,741][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:21:40,066][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:21:40,391][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:21:40,716][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:21:41,042][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:21:41,370][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:21:41,694][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:21:42,020][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:21:42,347][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:21:42,672][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:21:42,998][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:21:43,323][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:21:43,648][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:21:43,973][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:21:44,298][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:21:44,624][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:21:44,949][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:21:45,275][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:21:45,601][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:21:45,927][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:21:46,253][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:21:46,578][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:21:46,904][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:21:47,232][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:21:47,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:21:48,286][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:21:49,023][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:21:49,025][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:21:49,027][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:21:49,976][__main__][INFO] - Iteration 217 took 23s (40.05% Gen, 55.90% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 17m 37s. Estimated total time: 19h 35m 37s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 11s, 500 more iterations: 3h 15m 56s. [2025-11-13 09:21:49,978][__main__][INFO] - Starting iteration 217. [2025-11-13 09:21:49,981][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:21:49,981][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:21:59,392][__main__][INFO] - Number of regex retries in iteration 217: 0 [2025-11-13 09:21:59,392][__main__][INFO] - agents played in iteration 217 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:21:59,834][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:59,869][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:59,902][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:59,936][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:59,937][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:21:59,937][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:22:00,661][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:22:00,959][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:22:01,289][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:22:01,615][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:22:01,941][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:22:02,267][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:22:02,594][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:22:02,920][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:22:03,249][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:22:03,577][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:22:03,904][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:22:04,231][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:22:04,557][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:22:04,887][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:22:05,218][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:22:05,546][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:22:05,871][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:22:06,202][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:22:06,534][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:22:06,867][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:22:07,197][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:22:07,525][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:22:07,851][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:22:08,177][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:22:08,502][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:22:08,828][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:22:09,156][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:22:09,483][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:22:09,812][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:22:10,137][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:22:10,463][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:22:10,788][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:22:11,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:22:11,838][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:22:12,577][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:22:12,578][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:22:12,580][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:22:13,531][__main__][INFO] - Iteration 218 took 23s (39.96% Gen, 56.00% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 19m 8s. Estimated total time: 19h 37m 33s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 15s, 500 more iterations: 3h 16m 15s. [2025-11-13 09:22:13,533][__main__][INFO] - Starting iteration 218. [2025-11-13 09:22:13,537][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:22:13,538][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:22:23,191][__main__][INFO] - Number of regex retries in iteration 218: 0 [2025-11-13 09:22:23,192][__main__][INFO] - agents played in iteration 218 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:22:23,635][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:23,669][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:23,704][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:23,738][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:23,738][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:22:23,739][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:22:24,455][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:22:24,753][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:22:25,080][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:22:25,405][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:22:25,733][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:22:26,057][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:22:26,383][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:22:26,709][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:22:27,036][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:22:27,362][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:22:27,687][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:22:28,013][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:22:28,339][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:22:28,664][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:22:28,990][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:22:29,315][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:22:29,641][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:22:29,966][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:22:30,292][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:22:30,618][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:22:30,949][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:22:31,277][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:22:31,604][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:22:31,929][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:22:32,256][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:22:32,581][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:22:32,914][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:22:33,241][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:22:33,567][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:22:33,895][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:22:34,222][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:22:34,548][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:22:34,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:22:35,616][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:22:36,340][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:22:36,342][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:22:36,344][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:22:37,252][__main__][INFO] - Iteration 219 took 23s (40.71% Gen, 55.46% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 27m 1s. Estimated total time: 19h 45m 49s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 31s, 500 more iterations: 3h 17m 38s. [2025-11-13 09:22:37,254][__main__][INFO] - Starting iteration 219. [2025-11-13 09:22:37,257][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:22:37,258][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:22:45,854][__main__][INFO] - Number of regex retries in iteration 219: 0 [2025-11-13 09:22:45,855][__main__][INFO] - agents played in iteration 219 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:22:46,281][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:46,316][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:46,349][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:46,383][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:46,383][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:22:46,383][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:22:47,096][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:22:47,394][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:22:47,720][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:22:48,049][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:22:48,377][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:22:48,703][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:22:49,028][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:22:49,353][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:22:49,680][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:22:50,005][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:22:50,330][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:22:50,656][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:22:50,982][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:22:51,308][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:22:51,634][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:22:51,961][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:22:52,287][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:22:52,612][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:22:52,937][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:22:53,264][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:22:53,590][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:22:53,917][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:22:54,245][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:22:54,572][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:22:54,897][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:22:55,223][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:22:55,549][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:22:55,874][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:22:56,199][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:22:56,525][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:22:56,852][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:22:57,178][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:22:57,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:22:58,204][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:22:58,926][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:22:58,928][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:22:58,930][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:22:59,850][__main__][INFO] - Iteration 220 took 22s (38.05% Gen, 57.87% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 30m 31s. Estimated total time: 18h 49m 42s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 39s, 500 more iterations: 3h 8m 17s. [2025-11-13 09:22:59,853][__main__][INFO] - Starting iteration 220. [2025-11-13 09:22:59,855][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:22:59,856][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:23:08,684][__main__][INFO] - Number of regex retries in iteration 220: 0 [2025-11-13 09:23:08,685][__main__][INFO] - agents played in iteration 220 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:23:09,128][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:09,163][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:09,197][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:09,230][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:09,231][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:23:09,231][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:23:09,948][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:23:10,244][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:23:10,572][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:23:10,902][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:23:11,232][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:23:11,564][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:23:11,890][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:23:12,217][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:23:12,546][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:23:12,874][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:23:13,198][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:23:13,524][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:23:13,852][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:23:14,180][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:23:14,506][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:23:14,832][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:23:15,158][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:23:15,484][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:23:15,812][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:23:16,139][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:23:16,467][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:23:16,794][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:23:17,122][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:23:17,448][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:23:17,775][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:23:18,103][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:23:18,431][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:23:18,763][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:23:19,090][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:23:19,416][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:23:19,742][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:23:20,068][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:23:20,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:23:21,097][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:23:21,832][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:23:21,833][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:23:21,835][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:23:23,744][__main__][INFO] - Iteration 221 took 23s (36.96% Gen, 55.04% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 34m 55s. Estimated total time: 19h 54m 30s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 49s, 500 more iterations: 3h 19m 5s. [2025-11-13 09:23:23,746][__main__][INFO] - Starting iteration 221. [2025-11-13 09:23:23,749][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:23:23,749][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:23:32,828][__main__][INFO] - Number of regex retries in iteration 221: 0 [2025-11-13 09:23:32,829][__main__][INFO] - agents played in iteration 221 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:23:33,263][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:33,301][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:33,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:33,689][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:33,690][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:23:33,690][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:23:34,411][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:23:34,709][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:23:35,038][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:23:35,365][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:23:35,692][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:23:36,017][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:23:36,344][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:23:36,669][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:23:36,998][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:23:37,326][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:23:37,651][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:23:37,983][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:23:38,311][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:23:38,640][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:23:38,966][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:23:39,292][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:23:39,618][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:23:39,945][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:23:40,271][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:23:40,598][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:23:40,923][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:23:41,250][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:23:41,578][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:23:41,905][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:23:42,231][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:23:42,558][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:23:42,885][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:23:43,211][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:23:43,540][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:23:43,867][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:23:44,196][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:23:44,525][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:23:44,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:23:45,557][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:23:46,283][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:23:46,285][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:23:46,287][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:23:47,227][__main__][INFO] - Iteration 222 took 23s (38.67% Gen, 57.32% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 13m 59s. Estimated total time: 19h 33m 57s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 7s, 500 more iterations: 3h 15m 39s. [2025-11-13 09:23:47,229][__main__][INFO] - Starting iteration 222. [2025-11-13 09:23:47,232][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:23:47,233][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:23:56,858][__main__][INFO] - Number of regex retries in iteration 222: 0 [2025-11-13 09:23:56,859][__main__][INFO] - agents played in iteration 222 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:23:57,289][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:57,323][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:57,357][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:57,391][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:57,392][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:23:57,392][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:23:58,117][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:23:58,414][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:23:58,744][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:23:59,076][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:23:59,406][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:23:59,734][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:24:00,060][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:24:00,386][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:24:00,713][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:24:01,038][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:24:01,363][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:24:01,689][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:24:02,014][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:24:02,340][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:24:02,665][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:24:02,992][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:24:03,318][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:24:03,643][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:24:03,970][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:24:04,296][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:24:04,621][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:24:04,946][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:24:05,272][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:24:05,599][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:24:05,925][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:24:06,253][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:24:06,579][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:24:06,906][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:24:07,231][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:24:07,557][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:24:07,885][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:24:08,209][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:24:08,536][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:24:09,240][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:24:09,971][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:24:09,973][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:24:09,975][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:24:10,892][__main__][INFO] - Iteration 223 took 23s (40.68% Gen, 55.44% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 22m 41s. Estimated total time: 19h 43m 3s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 26s, 500 more iterations: 3h 17m 10s. [2025-11-13 09:24:10,894][__main__][INFO] - Starting iteration 223. [2025-11-13 09:24:10,897][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:24:10,897][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:24:20,681][__main__][INFO] - Number of regex retries in iteration 223: 0 [2025-11-13 09:24:20,682][__main__][INFO] - agents played in iteration 223 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:24:21,120][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:21,154][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:21,188][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:21,223][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:21,223][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:24:21,224][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:24:21,948][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:24:22,244][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:24:22,572][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:24:22,897][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:24:23,224][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:24:23,550][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:24:23,876][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:24:24,202][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:24:24,528][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:24:24,853][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:24:25,178][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:24:25,506][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:24:25,831][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:24:26,157][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:24:26,483][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:24:26,809][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:24:27,137][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:24:27,462][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:24:27,787][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:24:28,112][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:24:28,439][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:24:28,765][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:24:29,090][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:24:29,416][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:24:29,743][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:24:30,069][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:24:30,395][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:24:30,721][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:24:31,047][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:24:31,373][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:24:31,700][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:24:32,026][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:24:32,352][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:24:33,062][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:24:33,776][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:24:33,777][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:24:33,779][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:24:34,870][__main__][INFO] - Iteration 224 took 23s (40.81% Gen, 54.63% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 37m 55s. Estimated total time: 19h 58m 41s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 57s, 500 more iterations: 3h 19m 46s. [2025-11-13 09:24:34,872][__main__][INFO] - Starting iteration 224. [2025-11-13 09:24:34,875][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:24:34,875][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:24:44,399][__main__][INFO] - Number of regex retries in iteration 224: 0 [2025-11-13 09:24:44,400][__main__][INFO] - agents played in iteration 224 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:24:44,842][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:44,876][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:44,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:44,943][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:44,944][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:24:44,945][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:24:45,656][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:24:45,954][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:24:46,280][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:24:46,609][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:24:46,934][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:24:47,260][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:24:47,587][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:24:47,912][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:24:48,239][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:24:48,565][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:24:48,890][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:24:49,218][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:24:49,547][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:24:49,878][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:24:50,209][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:24:50,536][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:24:50,863][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:24:51,189][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:24:51,516][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:24:51,851][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:24:52,184][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:24:52,516][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:24:52,842][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:24:53,171][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:24:53,500][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:24:53,827][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:24:54,154][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:24:54,481][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:24:54,814][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:24:55,143][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:24:55,474][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:24:55,806][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:24:56,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:24:56,878][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:24:57,588][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:24:57,589][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:24:57,591][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:24:58,515][__main__][INFO] - Iteration 225 took 23s (40.29% Gen, 55.80% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 20m 53s. Estimated total time: 19h 42m 3s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 24s, 500 more iterations: 3h 17m 0s. [2025-11-13 09:24:58,517][__main__][INFO] - Starting iteration 225. [2025-11-13 09:24:58,520][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:24:58,521][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:25:07,903][__main__][INFO] - Number of regex retries in iteration 225: 0 [2025-11-13 09:25:07,903][__main__][INFO] - agents played in iteration 225 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:25:08,338][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:08,372][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:08,407][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:08,442][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:08,442][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:25:08,443][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:25:09,157][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:25:09,453][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:25:09,782][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:25:10,108][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:25:10,434][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:25:10,760][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:25:11,085][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:25:11,411][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:25:11,737][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:25:12,064][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:25:12,391][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:25:12,716][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:25:13,044][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:25:13,369][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:25:13,696][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:25:14,021][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:25:14,347][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:25:14,675][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:25:15,002][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:25:15,327][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:25:15,655][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:25:15,982][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:25:16,309][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:25:16,633][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:25:16,961][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:25:17,287][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:25:17,613][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:25:17,940][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:25:18,266][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:25:18,592][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:25:18,919][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:25:19,245][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:25:19,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:25:20,292][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:25:21,002][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:25:21,004][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:25:21,005][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:25:21,927][__main__][INFO] - Iteration 226 took 23s (40.08% Gen, 55.98% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 8m 49s. Estimated total time: 19h 30m 22s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 0s, 500 more iterations: 3h 15m 3s. [2025-11-13 09:25:21,929][__main__][INFO] - Starting iteration 226. [2025-11-13 09:25:21,932][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:25:21,932][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:25:31,451][__main__][INFO] - Number of regex retries in iteration 226: 0 [2025-11-13 09:25:31,452][__main__][INFO] - agents played in iteration 226 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:25:31,901][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:31,935][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:31,969][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:32,003][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:32,004][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:25:32,004][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:25:32,726][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:25:33,022][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:25:33,350][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:25:33,676][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:25:34,003][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:25:34,329][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:25:34,654][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:25:34,981][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:25:35,307][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:25:35,632][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:25:35,957][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:25:36,283][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:25:36,609][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:25:36,934][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:25:37,262][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:25:37,590][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:25:37,915][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:25:38,242][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:25:38,568][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:25:38,895][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:25:39,223][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:25:39,552][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:25:39,878][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:25:40,206][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:25:40,531][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:25:40,857][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:25:41,183][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:25:41,514][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:25:41,841][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:25:42,173][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:25:42,504][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:25:42,831][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:25:43,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:25:43,888][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:25:44,593][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:25:44,595][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:25:44,597][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:25:45,663][__main__][INFO] - Iteration 227 took 23s (40.11% Gen, 55.39% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 24m 39s. Estimated total time: 19h 46m 36s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 33s, 500 more iterations: 3h 17m 46s. [2025-11-13 09:25:45,665][__main__][INFO] - Starting iteration 227. [2025-11-13 09:25:45,667][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:25:45,668][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:25:55,214][__main__][INFO] - Number of regex retries in iteration 227: 0 [2025-11-13 09:25:55,214][__main__][INFO] - agents played in iteration 227 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:25:55,654][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:55,687][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:55,723][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:55,757][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:55,758][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:25:55,758][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:25:56,466][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:25:56,763][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:25:57,089][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:25:57,414][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:25:57,740][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:25:58,065][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:25:58,392][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:25:58,718][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:25:59,046][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:25:59,374][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:25:59,699][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:26:00,026][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:26:00,352][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:26:00,683][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:26:01,010][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:26:01,336][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:26:01,665][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:26:01,994][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:26:02,319][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:26:02,645][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:26:02,977][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:26:03,303][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:26:03,629][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:26:03,956][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:26:04,282][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:26:04,608][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:26:04,933][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:26:05,259][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:26:05,588][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:26:05,915][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:26:06,241][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:26:06,568][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:26:06,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:26:07,640][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:26:08,345][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:26:08,347][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:26:08,348][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:26:09,248][__main__][INFO] - Iteration 228 took 23s (40.48% Gen, 55.70% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 16m 43s. Estimated total time: 19h 39m 3s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 18s, 500 more iterations: 3h 16m 30s. [2025-11-13 09:26:09,250][__main__][INFO] - Starting iteration 228. [2025-11-13 09:26:09,252][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:26:09,253][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:26:18,030][__main__][INFO] - Number of regex retries in iteration 228: 0 [2025-11-13 09:26:18,030][__main__][INFO] - agents played in iteration 228 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:26:18,473][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:18,506][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:18,539][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:18,572][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:18,573][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:26:18,574][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:26:19,310][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:26:19,606][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:26:19,934][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:26:20,260][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:26:20,588][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:26:20,916][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:26:21,243][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:26:21,572][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:26:21,899][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:26:22,230][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:26:22,556][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:26:22,882][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:26:23,209][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:26:23,535][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:26:23,861][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:26:24,186][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:26:24,512][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:26:24,837][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:26:25,162][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:26:25,488][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:26:25,813][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:26:26,139][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:26:26,465][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:26:26,792][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:26:27,118][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:26:27,448][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:26:27,776][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:26:28,107][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:26:28,435][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:26:28,761][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:26:29,089][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:26:29,417][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:26:29,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:26:30,469][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:26:31,173][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:26:31,175][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:26:31,177][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:26:32,368][__main__][INFO] - Iteration 229 took 23s (37.97% Gen, 56.87% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 53m 5s. Estimated total time: 19h 15m 49s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 31s, 500 more iterations: 3h 12m 38s. [2025-11-13 09:26:32,370][__main__][INFO] - Starting iteration 229. [2025-11-13 09:26:32,373][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:26:32,373][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:26:41,772][__main__][INFO] - Number of regex retries in iteration 229: 0 [2025-11-13 09:26:41,773][__main__][INFO] - agents played in iteration 229 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:26:42,215][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:42,248][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:42,281][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:42,315][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:42,315][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:26:42,316][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:26:43,035][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:26:43,330][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:26:43,660][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:26:43,991][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:26:44,319][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:26:44,646][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:26:44,974][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:26:45,300][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:26:45,625][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:26:45,953][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:26:46,281][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:26:46,610][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:26:46,939][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:26:47,266][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:26:47,592][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:26:47,919][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:26:48,245][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:26:48,570][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:26:48,897][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:26:49,224][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:26:49,550][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:26:49,876][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:26:50,202][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:26:50,529][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:26:50,854][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:26:51,181][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:26:51,508][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:26:51,834][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:26:52,161][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:26:52,489][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:26:52,816][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:26:53,144][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:26:53,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:26:54,196][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:26:54,936][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:26:54,937][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:26:54,939][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:26:55,952][__main__][INFO] - Iteration 230 took 23s (39.86% Gen, 55.84% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 15m 54s. Estimated total time: 19h 39m 1s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 18s, 500 more iterations: 3h 16m 30s. [2025-11-13 09:26:55,955][__main__][INFO] - Starting iteration 230. [2025-11-13 09:26:55,958][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:26:55,958][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:27:05,304][__main__][INFO] - Number of regex retries in iteration 230: 0 [2025-11-13 09:27:05,305][__main__][INFO] - agents played in iteration 230 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:27:05,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:05,795][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:05,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:05,863][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:05,863][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:27:05,864][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:27:06,602][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:27:06,899][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:27:07,226][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:27:07,552][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:27:07,879][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:27:08,209][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:27:08,540][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:27:08,864][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:27:09,191][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:27:09,519][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:27:09,850][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:27:10,176][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:27:10,502][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:27:10,830][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:27:11,155][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:27:11,482][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:27:11,811][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:27:12,139][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:27:12,469][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:27:12,796][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:27:13,125][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:27:13,457][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:27:13,786][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:27:14,115][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:27:14,441][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:27:14,769][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:27:15,095][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:27:15,422][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:27:15,751][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:27:16,079][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:27:16,405][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:27:16,731][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:27:17,058][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:27:17,783][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:27:18,505][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:27:18,506][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:27:18,508][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:27:20,321][__main__][INFO] - Iteration 231 took 24s (38.36% Gen, 54.19% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 54m 39s. Estimated total time: 20h 18m 10s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 36s, 500 more iterations: 3h 23m 1s. [2025-11-13 09:27:20,322][__main__][INFO] - Starting iteration 231. [2025-11-13 09:27:20,327][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:27:20,327][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:27:30,134][__main__][INFO] - Number of regex retries in iteration 231: 0 [2025-11-13 09:27:30,134][__main__][INFO] - agents played in iteration 231 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:27:30,576][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:30,610][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:30,643][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:30,677][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:30,677][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:27:30,678][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:27:31,395][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:27:31,691][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:27:32,018][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:27:32,345][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:27:32,675][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:27:33,002][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:27:33,326][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:27:33,652][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:27:33,978][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:27:34,305][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:27:34,632][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:27:34,958][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:27:35,284][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:27:35,612][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:27:35,940][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:27:36,267][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:27:36,598][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:27:36,924][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:27:37,251][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:27:37,579][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:27:37,906][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:27:38,232][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:27:38,559][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:27:38,886][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:27:39,213][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:27:39,540][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:27:39,867][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:27:40,193][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:27:40,522][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:27:40,850][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:27:41,182][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:27:41,512][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:27:41,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:27:42,566][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:27:43,282][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:27:43,283][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:27:43,285][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:27:44,371][__main__][INFO] - Iteration 232 took 24s (40.79% Gen, 54.69% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 38m 20s. Estimated total time: 20h 2m 16s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 4s, 500 more iterations: 3h 20m 22s. [2025-11-13 09:27:44,373][__main__][INFO] - Starting iteration 232. [2025-11-13 09:27:44,375][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:27:44,376][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:27:53,340][__main__][INFO] - Number of regex retries in iteration 232: 0 [2025-11-13 09:27:53,341][__main__][INFO] - agents played in iteration 232 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:27:53,780][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:53,815][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:53,849][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:53,883][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:53,883][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:27:53,884][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:27:54,588][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:27:54,882][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:27:55,208][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:27:55,535][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:27:55,861][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:27:56,187][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:27:56,513][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:27:56,840][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:27:57,166][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:27:57,498][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:27:57,826][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:27:58,152][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:27:58,478][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:27:58,804][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:27:59,131][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:27:59,457][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:27:59,783][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:28:00,110][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:28:00,436][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:28:00,763][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:28:01,090][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:28:01,416][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:28:01,743][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:28:02,070][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:28:02,396][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:28:02,725][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:28:03,052][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:28:03,380][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:28:03,711][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:28:04,039][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:28:04,369][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:28:04,695][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:28:05,026][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:28:05,751][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:28:06,454][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:28:06,455][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:28:06,457][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:28:07,348][__main__][INFO] - Iteration 233 took 22s (39.02% Gen, 57.09% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 44m 22s. Estimated total time: 19h 8m 40s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 17s, 500 more iterations: 3h 11m 26s. [2025-11-13 09:28:07,350][__main__][INFO] - Starting iteration 233. [2025-11-13 09:28:07,352][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:28:07,353][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:28:16,574][__main__][INFO] - Number of regex retries in iteration 233: 0 [2025-11-13 09:28:16,575][__main__][INFO] - agents played in iteration 233 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:28:17,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:17,044][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:17,077][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:17,111][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:17,112][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:28:17,112][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:28:17,819][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:28:18,116][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:28:18,443][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:28:18,768][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:28:19,095][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:28:19,422][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:28:19,748][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:28:20,074][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:28:20,401][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:28:20,728][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:28:21,056][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:28:21,385][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:28:21,713][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:28:22,039][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:28:22,366][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:28:22,691][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:28:23,020][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:28:23,346][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:28:23,673][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:28:23,999][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:28:24,326][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:28:24,651][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:28:24,979][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:28:25,304][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:28:25,630][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:28:25,956][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:28:26,284][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:28:26,612][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:28:26,938][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:28:27,263][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:28:27,596][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:28:27,924][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:28:28,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:28:28,959][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:28:29,687][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:28:29,688][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:28:29,690][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:28:30,677][__main__][INFO] - Iteration 234 took 23s (39.53% Gen, 56.23% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 1m 35s. Estimated total time: 19h 26m 16s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 52s, 500 more iterations: 3h 14m 22s. [2025-11-13 09:28:30,679][__main__][INFO] - Starting iteration 234. [2025-11-13 09:28:30,683][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:28:30,683][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:28:39,583][__main__][INFO] - Number of regex retries in iteration 234: 0 [2025-11-13 09:28:39,584][__main__][INFO] - agents played in iteration 234 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:28:40,026][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:40,062][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:40,095][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:40,129][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:40,130][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:28:40,130][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:28:40,862][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:28:41,159][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:28:41,486][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:28:41,814][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:28:42,142][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:28:42,472][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:28:42,805][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:28:43,132][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:28:43,462][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:28:43,792][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:28:44,120][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:28:44,447][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:28:44,775][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:28:45,101][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:28:45,429][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:28:45,755][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:28:46,081][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:28:46,407][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:28:46,735][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:28:47,061][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:28:47,388][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:28:47,714][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:28:48,040][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:28:48,367][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:28:48,693][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:28:49,018][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:28:49,345][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:28:49,671][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:28:49,996][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:28:50,324][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:28:50,650][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:28:50,978][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:28:51,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:28:52,025][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:28:52,746][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:28:52,747][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:28:52,749][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:28:53,641][__main__][INFO] - Iteration 235 took 22s (38.77% Gen, 57.34% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 42m 54s. Estimated total time: 19h 7m 59s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 15s, 500 more iterations: 3h 11m 19s. [2025-11-13 09:28:53,643][__main__][INFO] - Starting iteration 235. [2025-11-13 09:28:53,646][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:28:53,646][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:29:02,917][__main__][INFO] - Number of regex retries in iteration 235: 0 [2025-11-13 09:29:02,918][__main__][INFO] - agents played in iteration 235 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:29:03,361][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:03,394][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:03,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:03,462][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:03,463][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:29:03,463][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:29:04,182][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:29:04,477][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:29:04,804][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:29:05,132][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:29:05,461][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:29:05,786][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:29:06,113][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:29:06,441][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:29:06,770][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:29:07,097][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:29:07,426][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:29:07,753][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:29:08,079][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:29:08,406][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:29:08,732][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:29:09,059][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:29:09,385][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:29:09,711][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:29:10,039][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:29:10,366][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:29:10,691][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:29:11,019][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:29:11,346][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:29:11,675][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:29:12,002][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:29:12,330][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:29:12,657][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:29:12,983][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:29:13,310][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:29:13,638][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:29:13,966][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:29:14,292][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:29:14,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:29:15,345][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:29:16,086][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:29:16,087][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:29:16,089][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:29:17,272][__main__][INFO] - Iteration 236 took 23s (39.24% Gen, 55.74% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 15m 52s. Estimated total time: 19h 41m 20s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 22s, 500 more iterations: 3h 16m 53s. [2025-11-13 09:29:17,274][__main__][INFO] - Starting iteration 236. [2025-11-13 09:29:17,277][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:29:17,277][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:29:25,996][__main__][INFO] - Number of regex retries in iteration 236: 0 [2025-11-13 09:29:25,996][__main__][INFO] - agents played in iteration 236 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:29:26,438][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:26,473][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:26,507][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:26,542][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:26,542][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:29:26,543][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:29:27,266][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:29:27,562][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:29:27,891][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:29:28,217][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:29:28,545][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:29:28,872][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:29:29,200][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:29:29,527][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:29:29,854][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:29:30,182][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:29:30,509][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:29:30,836][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:29:31,163][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:29:31,489][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:29:31,817][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:29:32,144][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:29:32,472][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:29:32,799][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:29:33,126][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:29:33,454][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:29:33,783][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:29:34,110][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:29:34,437][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:29:34,763][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:29:35,091][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:29:35,418][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:29:35,745][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:29:36,077][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:29:36,405][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:29:36,736][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:29:37,064][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:29:37,390][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:29:37,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:29:38,425][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:29:39,142][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:29:39,144][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:29:39,146][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:29:40,137][__main__][INFO] - Iteration 237 took 22s (38.14% Gen, 57.52% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 37m 12s. Estimated total time: 19h 3m 3s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 6s, 500 more iterations: 3h 10m 30s. [2025-11-13 09:29:40,139][__main__][INFO] - Starting iteration 237. [2025-11-13 09:29:40,141][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:29:40,142][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:29:49,289][__main__][INFO] - Number of regex retries in iteration 237: 0 [2025-11-13 09:29:49,290][__main__][INFO] - agents played in iteration 237 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:29:49,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:49,762][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:49,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:49,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:49,830][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:29:49,830][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:29:50,557][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:29:50,853][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:29:51,181][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:29:51,507][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:29:51,834][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:29:52,161][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:29:52,491][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:29:52,817][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:29:53,143][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:29:53,470][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:29:53,795][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:29:54,121][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:29:54,448][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:29:54,774][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:29:55,102][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:29:55,432][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:29:55,760][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:29:56,087][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:29:56,414][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:29:56,741][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:29:57,067][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:29:57,393][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:29:57,720][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:29:58,049][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:29:58,376][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:29:58,705][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:29:59,031][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:29:59,357][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:29:59,684][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:30:00,011][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:30:00,339][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:30:00,666][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:30:00,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:30:01,706][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:30:02,419][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:30:02,420][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:30:02,422][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:30:03,324][__main__][INFO] - Iteration 238 took 23s (39.46% Gen, 56.65% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 52m 56s. Estimated total time: 19h 19m 11s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 38s, 500 more iterations: 3h 13m 11s. [2025-11-13 09:30:03,326][__main__][INFO] - Starting iteration 238. [2025-11-13 09:30:03,329][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:30:03,330][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:30:12,820][__main__][INFO] - Number of regex retries in iteration 238: 0 [2025-11-13 09:30:12,821][__main__][INFO] - agents played in iteration 238 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:30:13,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:13,295][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:13,329][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:13,363][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:13,363][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:30:13,364][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:30:14,084][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:30:14,380][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:30:14,708][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:30:15,037][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:30:15,364][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:30:15,691][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:30:16,018][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:30:16,345][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:30:16,672][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:30:16,997][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:30:17,324][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:30:17,652][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:30:17,979][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:30:18,306][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:30:18,635][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:30:18,963][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:30:19,292][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:30:19,622][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:30:19,955][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:30:20,282][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:30:20,609][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:30:20,936][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:30:21,265][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:30:21,592][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:30:21,920][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:30:22,247][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:30:22,579][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:30:22,909][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:30:23,238][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:30:23,565][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:30:23,891][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:30:24,218][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:30:24,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:30:25,268][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:30:25,965][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:30:25,966][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:30:25,968][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:30:27,054][__main__][INFO] - Iteration 239 took 23s (40.00% Gen, 55.41% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 19m 39s. Estimated total time: 19h 46m 17s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 32s, 500 more iterations: 3h 17m 42s. [2025-11-13 09:30:27,056][__main__][INFO] - Starting iteration 239. [2025-11-13 09:30:27,059][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:30:27,059][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:30:35,797][__main__][INFO] - Number of regex retries in iteration 239: 0 [2025-11-13 09:30:35,798][__main__][INFO] - agents played in iteration 239 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:30:36,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:36,270][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:36,304][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:36,338][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:36,339][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:30:36,339][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:30:37,066][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:30:37,363][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:30:37,691][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:30:38,016][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:30:38,342][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:30:38,674][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:30:39,002][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:30:39,331][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:30:39,658][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:30:39,987][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:30:40,315][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:30:40,643][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:30:40,969][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:30:41,296][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:30:41,624][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:30:41,952][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:30:42,280][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:30:42,610][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:30:42,940][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:30:43,267][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:30:43,599][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:30:43,925][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:30:44,252][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:30:44,578][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:30:44,910][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:30:45,236][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:30:45,562][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:30:45,894][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:30:46,221][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:30:46,549][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:30:46,880][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:30:47,206][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:30:47,533][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:30:48,245][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:30:48,964][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:30:48,966][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:30:48,968][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:30:49,915][__main__][INFO] - Iteration 240 took 22s (38.23% Gen, 57.62% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 35m 51s. Estimated total time: 19h 2m 51s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 5s, 500 more iterations: 3h 10m 28s. [2025-11-13 09:30:49,917][__main__][INFO] - Starting iteration 240. [2025-11-13 09:30:49,920][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:30:49,920][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:30:58,979][__main__][INFO] - Number of regex retries in iteration 240: 0 [2025-11-13 09:30:58,979][__main__][INFO] - agents played in iteration 240 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:30:59,429][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:59,462][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:59,495][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:59,529][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:59,530][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:30:59,530][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:31:00,261][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:31:00,558][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:31:00,885][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:31:01,211][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:31:01,541][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:31:01,867][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:31:02,193][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:31:02,519][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:31:02,845][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:31:03,172][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:31:03,499][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:31:03,825][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:31:04,152][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:31:04,480][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:31:04,805][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:31:05,132][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:31:05,459][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:31:05,786][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:31:06,118][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:31:06,446][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:31:06,772][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:31:07,101][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:31:07,426][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:31:07,752][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:31:08,079][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:31:08,407][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:31:08,734][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:31:09,059][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:31:09,386][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:31:09,716][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:31:10,041][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:31:10,368][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:31:10,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:31:11,416][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:31:12,150][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:31:12,152][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:31:12,153][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:31:14,092][__main__][INFO] - Iteration 241 took 24s (37.48% Gen, 54.50% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 41m 15s. Estimated total time: 20h 8m 40s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 17s, 500 more iterations: 3h 21m 26s. [2025-11-13 09:31:14,094][__main__][INFO] - Starting iteration 241. [2025-11-13 09:31:14,097][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:31:14,098][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:31:23,325][__main__][INFO] - Number of regex retries in iteration 241: 0 [2025-11-13 09:31:23,326][__main__][INFO] - agents played in iteration 241 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:31:23,765][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:23,799][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:23,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:24,206][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:24,207][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:31:24,207][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:31:24,908][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:31:25,204][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:31:25,531][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:31:25,857][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:31:26,185][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:31:26,516][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:31:26,843][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:31:27,169][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:31:27,503][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:31:27,831][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:31:28,162][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:31:28,489][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:31:28,816][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:31:29,148][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:31:29,480][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:31:29,810][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:31:30,142][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:31:30,475][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:31:30,806][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:31:31,133][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:31:31,466][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:31:31,793][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:31:32,122][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:31:32,449][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:31:32,777][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:31:33,103][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:31:33,430][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:31:33,762][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:31:34,091][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:31:34,421][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:31:34,747][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:31:35,078][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:31:35,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:31:36,184][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:31:36,866][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:31:36,868][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:31:36,870][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:31:37,751][__main__][INFO] - Iteration 242 took 23s (39.01% Gen, 57.26% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 14m 54s. Estimated total time: 19h 42m 43s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 25s, 500 more iterations: 3h 17m 7s. [2025-11-13 09:31:37,753][__main__][INFO] - Starting iteration 242. [2025-11-13 09:31:37,755][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:31:37,756][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:31:46,354][__main__][INFO] - Number of regex retries in iteration 242: 0 [2025-11-13 09:31:46,355][__main__][INFO] - agents played in iteration 242 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:31:46,797][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:46,830][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:46,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:46,897][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:46,898][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:31:46,898][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:31:47,629][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:31:47,926][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:31:48,254][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:31:48,580][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:31:48,907][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:31:49,235][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:31:49,562][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:31:49,888][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:31:50,215][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:31:50,542][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:31:50,869][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:31:51,196][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:31:51,525][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:31:51,855][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:31:52,184][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:31:52,512][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:31:52,839][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:31:53,167][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:31:53,500][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:31:53,830][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:31:54,159][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:31:54,489][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:31:54,817][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:31:55,143][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:31:55,470][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:31:55,798][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:31:56,125][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:31:56,453][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:31:56,778][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:31:57,107][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:31:57,433][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:31:57,759][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:31:58,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:31:58,804][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:31:59,524][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:31:59,526][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:31:59,528][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:32:00,422][__main__][INFO] - Iteration 243 took 22s (37.93% Gen, 58.11% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 25m 13s. Estimated total time: 18h 53m 24s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 46s, 500 more iterations: 3h 8m 54s. [2025-11-13 09:32:00,424][__main__][INFO] - Starting iteration 243. [2025-11-13 09:32:00,427][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:32:00,428][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:32:09,867][__main__][INFO] - Number of regex retries in iteration 243: 0 [2025-11-13 09:32:09,868][__main__][INFO] - agents played in iteration 243 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:32:10,308][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:10,341][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:10,374][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:10,409][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:10,409][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:32:10,410][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:32:11,113][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:32:11,409][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:32:11,737][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:32:12,063][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:32:12,390][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:32:12,716][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:32:13,042][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:32:13,368][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:32:13,694][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:32:14,020][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:32:14,346][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:32:14,672][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:32:14,998][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:32:15,329][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:32:15,657][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:32:15,986][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:32:16,314][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:32:16,642][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:32:16,972][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:32:17,300][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:32:17,626][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:32:17,955][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:32:18,283][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:32:18,610][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:32:18,938][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:32:19,265][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:32:19,591][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:32:19,919][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:32:20,246][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:32:20,574][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:32:20,903][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:32:21,231][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:32:21,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:32:22,277][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:32:22,967][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:32:22,969][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:32:22,970][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:32:23,849][__main__][INFO] - Iteration 244 took 23s (40.30% Gen, 55.94% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 2m 33s. Estimated total time: 19h 31m 7s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 2s, 500 more iterations: 3h 15m 11s. [2025-11-13 09:32:23,851][__main__][INFO] - Starting iteration 244. [2025-11-13 09:32:23,854][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:32:23,854][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:32:33,361][__main__][INFO] - Number of regex retries in iteration 244: 0 [2025-11-13 09:32:33,362][__main__][INFO] - agents played in iteration 244 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:32:33,795][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:33,831][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:33,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:33,897][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:33,897][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:32:33,898][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:32:34,590][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:32:34,887][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:32:35,213][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:32:35,541][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:32:35,869][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:32:36,196][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:32:36,523][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:32:36,849][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:32:37,177][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:32:37,504][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:32:37,831][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:32:38,162][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:32:38,491][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:32:38,817][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:32:39,145][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:32:39,473][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:32:39,801][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:32:40,128][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:32:40,454][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:32:40,782][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:32:41,110][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:32:41,442][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:32:41,774][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:32:42,106][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:32:42,432][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:32:42,758][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:32:43,089][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:32:43,418][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:32:43,745][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:32:44,072][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:32:44,399][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:32:44,726][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:32:45,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:32:45,779][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:32:46,460][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:32:46,462][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:32:46,463][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:32:47,348][__main__][INFO] - Iteration 245 took 23s (40.47% Gen, 55.76% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 5m 46s. Estimated total time: 19h 34m 44s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 9s, 500 more iterations: 3h 15m 47s. [2025-11-13 09:32:47,349][__main__][INFO] - Starting iteration 245. [2025-11-13 09:32:47,352][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:32:47,353][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:32:56,674][__main__][INFO] - Number of regex retries in iteration 245: 0 [2025-11-13 09:32:56,675][__main__][INFO] - agents played in iteration 245 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:32:57,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:57,145][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:57,179][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:57,212][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:57,212][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:32:57,212][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:32:57,909][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:32:58,205][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:32:58,533][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:32:58,860][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:32:59,187][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:32:59,515][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:32:59,841][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:33:00,172][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:33:00,499][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:33:00,826][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:33:01,151][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:33:01,478][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:33:01,807][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:33:02,132][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:33:02,460][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:33:02,790][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:33:03,119][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:33:03,446][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:33:03,772][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:33:04,102][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:33:04,432][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:33:04,759][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:33:05,085][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:33:05,411][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:33:05,737][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:33:06,061][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:33:06,389][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:33:06,714][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:33:07,040][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:33:07,367][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:33:07,693][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:33:08,020][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:33:08,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:33:09,077][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:33:09,753][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:33:09,755][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:33:09,756][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:33:10,651][__main__][INFO] - Iteration 246 took 23s (40.01% Gen, 56.15% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 55m 36s. Estimated total time: 19h 24m 57s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 49s, 500 more iterations: 3h 14m 9s. [2025-11-13 09:33:10,652][__main__][INFO] - Starting iteration 246. [2025-11-13 09:33:10,655][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:33:10,655][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:33:20,378][__main__][INFO] - Number of regex retries in iteration 246: 0 [2025-11-13 09:33:20,379][__main__][INFO] - agents played in iteration 246 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:33:20,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:20,846][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:20,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:20,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:20,913][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:33:20,913][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:33:21,618][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:33:21,914][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:33:22,242][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:33:22,568][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:33:22,899][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:33:23,230][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:33:23,562][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:33:23,891][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:33:24,221][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:33:24,550][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:33:24,876][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:33:25,202][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:33:25,527][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:33:25,858][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:33:26,186][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:33:26,512][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:33:26,840][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:33:27,166][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:33:27,492][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:33:27,819][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:33:28,147][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:33:28,473][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:33:28,802][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:33:29,130][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:33:29,462][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:33:29,791][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:33:30,120][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:33:30,446][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:33:30,773][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:33:31,100][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:33:31,427][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:33:31,755][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:33:32,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:33:32,799][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:33:33,479][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:33:33,481][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:33:33,482][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:33:34,359][__main__][INFO] - Iteration 247 took 23s (41.02% Gen, 55.27% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 15m 30s. Estimated total time: 19h 45m 15s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 30s, 500 more iterations: 3h 17m 32s. [2025-11-13 09:33:34,361][__main__][INFO] - Starting iteration 247. [2025-11-13 09:33:34,364][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:33:34,364][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:33:43,811][__main__][INFO] - Number of regex retries in iteration 247: 0 [2025-11-13 09:33:43,811][__main__][INFO] - agents played in iteration 247 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:33:44,236][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:44,273][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:44,307][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:44,340][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:44,340][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:33:44,341][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:33:45,028][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:33:45,325][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:33:45,651][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:33:45,980][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:33:46,308][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:33:46,633][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:33:46,960][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:33:47,288][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:33:47,615][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:33:47,941][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:33:48,269][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:33:48,594][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:33:48,922][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:33:49,249][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:33:49,577][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:33:49,907][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:33:50,235][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:33:50,562][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:33:50,889][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:33:51,214][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:33:51,540][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:33:51,867][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:33:52,196][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:33:52,523][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:33:52,851][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:33:53,179][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:33:53,506][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:33:53,832][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:33:54,164][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:33:54,490][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:33:54,816][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:33:55,144][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:33:55,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:33:56,183][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:33:56,867][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:33:56,869][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:33:56,871][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:33:57,761][__main__][INFO] - Iteration 248 took 23s (40.38% Gen, 55.82% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 59m 46s. Estimated total time: 19h 29m 55s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 59s, 500 more iterations: 3h 14m 59s. [2025-11-13 09:33:57,763][__main__][INFO] - Starting iteration 248. [2025-11-13 09:33:57,766][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:33:57,767][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:34:06,870][__main__][INFO] - Number of regex retries in iteration 248: 0 [2025-11-13 09:34:06,870][__main__][INFO] - agents played in iteration 248 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:34:07,309][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:07,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:07,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:07,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:07,413][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:34:07,413][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:34:08,108][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:34:08,405][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:34:08,730][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:34:09,060][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:34:09,391][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:34:09,716][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:34:10,042][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:34:10,368][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:34:10,693][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:34:11,021][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:34:11,350][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:34:11,677][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:34:12,003][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:34:12,329][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:34:12,657][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:34:12,986][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:34:13,312][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:34:13,640][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:34:13,971][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:34:14,300][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:34:14,632][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:34:14,958][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:34:15,286][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:34:15,612][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:34:15,940][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:34:16,272][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:34:16,601][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:34:16,928][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:34:17,257][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:34:17,585][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:34:17,911][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:34:18,238][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:34:18,565][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:34:19,278][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:34:19,959][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:34:19,961][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:34:19,962][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:34:20,887][__main__][INFO] - Iteration 249 took 23s (39.37% Gen, 56.62% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 45m 33s. Estimated total time: 19h 16m 4s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 32s, 500 more iterations: 3h 12m 40s. [2025-11-13 09:34:20,889][__main__][INFO] - Starting iteration 249. [2025-11-13 09:34:20,892][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:34:20,892][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:34:29,587][__main__][INFO] - Number of regex retries in iteration 249: 0 [2025-11-13 09:34:29,588][__main__][INFO] - agents played in iteration 249 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:34:30,016][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:30,049][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:30,081][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:30,114][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:30,115][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:34:30,116][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:34:30,804][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:34:31,100][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:34:31,428][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:34:31,754][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:34:32,081][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:34:32,407][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:34:32,733][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:34:33,061][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:34:33,387][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:34:33,715][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:34:34,043][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:34:34,374][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:34:34,702][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:34:35,033][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:34:35,360][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:34:35,687][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:34:36,016][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:34:36,342][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:34:36,669][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:34:36,995][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:34:37,322][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:34:37,650][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:34:37,977][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:34:38,304][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:34:38,631][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:34:38,962][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:34:39,294][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:34:39,623][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:34:39,955][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:34:40,282][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:34:40,609][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:34:40,941][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:34:41,270][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:34:41,987][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:34:42,683][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:34:42,685][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:34:42,686][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:34:43,574][__main__][INFO] - Iteration 250 took 22s (38.33% Gen, 57.74% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 23m 16s. Estimated total time: 18h 54m 10s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 48s, 500 more iterations: 3h 9m 1s. [2025-11-13 09:34:43,576][__main__][INFO] - Starting iteration 250. [2025-11-13 09:34:43,579][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:34:43,579][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:34:52,470][__main__][INFO] - Number of regex retries in iteration 250: 0 [2025-11-13 09:34:52,470][__main__][INFO] - agents played in iteration 250 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:34:52,905][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:52,938][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:52,971][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:53,004][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:53,005][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:34:53,005][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:34:53,699][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:34:53,994][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:34:54,321][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:34:54,648][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:34:54,973][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:34:55,300][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:34:55,627][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:34:55,953][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:34:56,280][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:34:56,608][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:34:56,934][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:34:57,261][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:34:57,588][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:34:57,915][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:34:58,241][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:34:58,566][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:34:58,892][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:34:59,217][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:34:59,543][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:34:59,869][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:35:00,195][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:35:00,522][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:35:00,847][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:35:01,176][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:35:01,504][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:35:01,830][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:35:02,159][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:35:02,488][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:35:02,818][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:35:03,144][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:35:03,470][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:35:03,798][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:35:04,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:35:04,858][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:35:05,549][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:35:05,551][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:35:05,553][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:35:07,387][__main__][INFO] - Iteration 251 took 23s (37.34% Gen, 54.95% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 19m 9s. Estimated total time: 19h 50m 27s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 40s, 500 more iterations: 3h 18m 24s. [2025-11-13 09:35:07,389][__main__][INFO] - Starting iteration 251. [2025-11-13 09:35:07,392][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:35:07,392][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:35:17,328][__main__][INFO] - Number of regex retries in iteration 251: 0 [2025-11-13 09:35:17,329][__main__][INFO] - agents played in iteration 251 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:35:17,757][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:17,794][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:17,828][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:18,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:18,178][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:35:18,178][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:35:18,883][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:35:19,180][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:35:19,508][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:35:19,836][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:35:20,162][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:35:20,488][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:35:20,815][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:35:21,141][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:35:21,468][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:35:21,794][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:35:22,120][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:35:22,449][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:35:22,775][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:35:23,103][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:35:23,428][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:35:23,754][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:35:24,081][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:35:24,407][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:35:24,733][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:35:25,059][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:35:25,384][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:35:25,709][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:35:26,037][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:35:26,363][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:35:26,690][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:35:27,016][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:35:27,345][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:35:27,672][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:35:27,999][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:35:28,326][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:35:28,654][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:35:28,981][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:35:29,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:35:30,030][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:35:30,709][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:35:30,711][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:35:30,712][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:35:31,886][__main__][INFO] - Iteration 252 took 24s (40.57% Gen, 54.64% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 53m 2s. Estimated total time: 20h 24m 45s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 49s, 500 more iterations: 3h 24m 7s. [2025-11-13 09:35:31,888][__main__][INFO] - Starting iteration 252. [2025-11-13 09:35:31,891][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:35:31,891][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:35:41,752][__main__][INFO] - Number of regex retries in iteration 252: 0 [2025-11-13 09:35:41,753][__main__][INFO] - agents played in iteration 252 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:35:42,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:42,215][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:42,249][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:42,283][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:42,284][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:35:42,284][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:35:43,014][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:35:43,311][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:35:43,639][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:35:43,971][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:35:44,302][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:35:44,634][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:35:44,963][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:35:45,294][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:35:45,623][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:35:45,949][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:35:46,275][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:35:46,602][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:35:46,927][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:35:47,252][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:35:47,581][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:35:47,908][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:35:48,234][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:35:48,560][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:35:48,886][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:35:49,212][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:35:49,538][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:35:49,867][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:35:50,193][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:35:50,519][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:35:50,846][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:35:51,176][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:35:51,505][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:35:51,834][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:35:52,166][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:35:52,493][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:35:52,820][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:35:53,146][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:35:53,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:35:54,187][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:35:54,877][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:35:54,878][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:35:54,880][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:35:55,842][__main__][INFO] - Iteration 253 took 23s (41.17% Gen, 54.81% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 25m 28s. Estimated total time: 19h 57m 35s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 55s, 500 more iterations: 3h 19m 35s. [2025-11-13 09:35:55,843][__main__][INFO] - Starting iteration 253. [2025-11-13 09:35:55,846][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:35:55,847][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:36:05,533][__main__][INFO] - Number of regex retries in iteration 253: 0 [2025-11-13 09:36:05,534][__main__][INFO] - agents played in iteration 253 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:36:05,972][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:06,007][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:06,041][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:06,075][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:06,076][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:36:06,076][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:36:06,784][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:36:07,081][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:36:07,408][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:36:07,734][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:36:08,061][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:36:08,388][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:36:08,713][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:36:09,039][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:36:09,365][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:36:09,691][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:36:10,017][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:36:10,342][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:36:10,668][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:36:10,994][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:36:11,321][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:36:11,647][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:36:11,974][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:36:12,303][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:36:12,631][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:36:12,958][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:36:13,287][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:36:13,613][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:36:13,939][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:36:14,267][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:36:14,593][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:36:14,923][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:36:15,253][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:36:15,580][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:36:15,910][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:36:16,236][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:36:16,563][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:36:16,889][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:36:17,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:36:17,936][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:36:18,635][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:36:18,637][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:36:18,638][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:36:19,535][__main__][INFO] - Iteration 254 took 23s (40.89% Gen, 55.32% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 11m 57s. Estimated total time: 19h 44m 27s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 28s, 500 more iterations: 3h 17m 24s. [2025-11-13 09:36:19,537][__main__][INFO] - Starting iteration 254. [2025-11-13 09:36:19,541][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:36:19,541][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:36:28,439][__main__][INFO] - Number of regex retries in iteration 254: 0 [2025-11-13 09:36:28,440][__main__][INFO] - agents played in iteration 254 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:36:28,939][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:28,972][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:29,005][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:29,038][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:29,038][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:36:29,039][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:36:29,751][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:36:30,047][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:36:30,376][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:36:30,704][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:36:31,029][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:36:31,358][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:36:31,685][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:36:32,010][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:36:32,337][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:36:32,662][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:36:32,988][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:36:33,313][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:36:33,638][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:36:33,964][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:36:34,290][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:36:34,616][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:36:34,941][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:36:35,269][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:36:35,596][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:36:35,923][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:36:36,250][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:36:36,580][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:36:36,909][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:36:37,235][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:36:37,563][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:36:37,889][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:36:38,215][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:36:38,543][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:36:38,870][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:36:39,198][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:36:39,529][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:36:39,855][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:36:40,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:36:40,896][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:36:41,605][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:36:41,607][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:36:41,609][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:36:42,489][__main__][INFO] - Iteration 255 took 22s (38.77% Gen, 57.38% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 34m 33s. Estimated total time: 19h 7m 26s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 14s, 500 more iterations: 3h 11m 14s. [2025-11-13 09:36:42,491][__main__][INFO] - Starting iteration 255. [2025-11-13 09:36:42,494][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:36:42,494][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:36:51,576][__main__][INFO] - Number of regex retries in iteration 255: 0 [2025-11-13 09:36:51,577][__main__][INFO] - agents played in iteration 255 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:36:52,001][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:52,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:52,397][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:52,430][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:52,431][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:36:52,431][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:36:53,153][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:36:53,450][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:36:53,780][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:36:54,107][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:36:54,434][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:36:54,760][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:36:55,089][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:36:55,420][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:36:55,746][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:36:56,072][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:36:56,399][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:36:56,725][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:36:57,052][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:36:57,380][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:36:57,710][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:36:58,041][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:36:58,369][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:36:58,695][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:36:59,026][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:36:59,352][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:36:59,678][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:37:00,009][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:37:00,342][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:37:00,670][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:37:00,998][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:37:01,325][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:37:01,654][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:37:01,980][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:37:02,307][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:37:02,634][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:37:02,962][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:37:03,289][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:37:03,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:37:04,335][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:37:05,035][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:37:05,037][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:37:05,038][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:37:05,944][__main__][INFO] - Iteration 256 took 23s (38.73% Gen, 57.40% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 59m 16s. Estimated total time: 19h 32m 33s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 5s, 500 more iterations: 3h 15m 25s. [2025-11-13 09:37:05,946][__main__][INFO] - Starting iteration 256. [2025-11-13 09:37:05,949][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:37:05,949][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:37:15,243][__main__][INFO] - Number of regex retries in iteration 256: 0 [2025-11-13 09:37:15,243][__main__][INFO] - agents played in iteration 256 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:37:15,677][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:15,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:15,744][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:15,777][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:15,778][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:37:15,779][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:37:16,479][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:37:16,777][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:37:17,107][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:37:17,435][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:37:17,763][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:37:18,090][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:37:18,416][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:37:18,742][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:37:19,070][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:37:19,400][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:37:19,732][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:37:20,064][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:37:20,391][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:37:20,721][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:37:21,048][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:37:21,374][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:37:21,705][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:37:22,034][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:37:22,362][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:37:22,690][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:37:23,019][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:37:23,351][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:37:23,678][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:37:24,005][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:37:24,334][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:37:24,663][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:37:24,992][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:37:25,319][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:37:25,645][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:37:25,972][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:37:26,299][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:37:26,625][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:37:26,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:37:27,691][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:37:28,398][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:37:28,400][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:37:28,401][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:37:29,296][__main__][INFO] - Iteration 257 took 23s (39.81% Gen, 56.36% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 53m 44s. Estimated total time: 19h 27m 24s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 54s, 500 more iterations: 3h 14m 34s. [2025-11-13 09:37:29,298][__main__][INFO] - Starting iteration 257. [2025-11-13 09:37:29,301][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:37:29,302][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:37:38,914][__main__][INFO] - Number of regex retries in iteration 257: 0 [2025-11-13 09:37:38,915][__main__][INFO] - agents played in iteration 257 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:37:39,361][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:39,400][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:39,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:39,467][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:39,467][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:37:39,468][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:37:40,162][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:37:40,458][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:37:40,786][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:37:41,114][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:37:41,442][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:37:41,768][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:37:42,095][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:37:42,420][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:37:42,747][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:37:43,073][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:37:43,401][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:37:43,728][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:37:44,056][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:37:44,384][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:37:44,712][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:37:45,044][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:37:45,373][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:37:45,702][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:37:46,027][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:37:46,356][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:37:46,682][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:37:47,009][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:37:47,335][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:37:47,663][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:37:47,990][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:37:48,316][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:37:48,642][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:37:48,968][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:37:49,295][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:37:49,622][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:37:49,949][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:37:50,276][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:37:50,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:37:51,351][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:37:52,060][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:37:52,062][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:37:52,063][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:37:52,976][__main__][INFO] - Iteration 258 took 23s (40.60% Gen, 55.54% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 9m 43s. Estimated total time: 19h 43m 47s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 27s, 500 more iterations: 3h 17m 17s. [2025-11-13 09:37:52,982][__main__][INFO] - Starting iteration 258. [2025-11-13 09:37:52,985][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:37:52,985][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:38:02,258][__main__][INFO] - Number of regex retries in iteration 258: 0 [2025-11-13 09:38:02,259][__main__][INFO] - agents played in iteration 258 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:38:02,685][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:02,719][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:02,753][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:02,787][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:02,787][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:38:02,788][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:38:03,499][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:38:03,794][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:38:04,120][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:38:04,447][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:38:04,772][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:38:05,099][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:38:05,424][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:38:05,750][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:38:06,078][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:38:06,405][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:38:06,732][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:38:07,061][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:38:07,389][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:38:07,720][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:38:08,047][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:38:08,374][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:38:08,701][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:38:09,033][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:38:09,360][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:38:09,687][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:38:10,013][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:38:10,339][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:38:10,667][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:38:10,994][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:38:11,321][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:38:11,648][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:38:11,976][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:38:12,301][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:38:12,630][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:38:12,957][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:38:13,284][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:38:13,612][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:38:13,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:38:14,690][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:38:15,397][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:38:15,398][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:38:15,400][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:38:16,318][__main__][INFO] - Iteration 259 took 23s (39.74% Gen, 56.32% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 52m 15s. Estimated total time: 19h 26m 42s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 53s, 500 more iterations: 3h 14m 27s. [2025-11-13 09:38:16,320][__main__][INFO] - Starting iteration 259. [2025-11-13 09:38:16,323][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:38:16,324][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:38:25,379][__main__][INFO] - Number of regex retries in iteration 259: 0 [2025-11-13 09:38:25,379][__main__][INFO] - agents played in iteration 259 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:38:25,815][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:25,852][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:25,885][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:25,919][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:25,919][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:38:25,919][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:38:26,623][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:38:26,920][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:38:27,250][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:38:27,576][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:38:27,902][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:38:28,230][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:38:28,555][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:38:28,883][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:38:29,212][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:38:29,541][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:38:29,866][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:38:30,193][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:38:30,518][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:38:30,845][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:38:31,174][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:38:31,503][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:38:31,829][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:38:32,157][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:38:32,485][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:38:32,811][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:38:33,138][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:38:33,466][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:38:33,796][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:38:34,123][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:38:34,452][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:38:34,778][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:38:35,105][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:38:35,432][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:38:35,759][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:38:36,086][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:38:36,413][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:38:36,740][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:38:37,066][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:38:37,798][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:38:38,497][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:38:38,499][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:38:38,500][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:38:39,577][__main__][INFO] - Iteration 260 took 23s (38.94% Gen, 56.42% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 47m 55s. Estimated total time: 19h 22m 46s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 45s, 500 more iterations: 3h 13m 47s. [2025-11-13 09:38:39,580][__main__][INFO] - Starting iteration 260. [2025-11-13 09:38:39,583][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:38:39,583][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:38:48,791][__main__][INFO] - Number of regex retries in iteration 260: 0 [2025-11-13 09:38:48,792][__main__][INFO] - agents played in iteration 260 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:38:49,224][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:49,258][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:49,293][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:49,327][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:49,327][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:38:49,328][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:38:50,045][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:38:50,343][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:38:50,671][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:38:50,999][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:38:51,330][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:38:51,656][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:38:51,983][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:38:52,309][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:38:52,638][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:38:52,969][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:38:53,296][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:38:53,627][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:38:53,956][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:38:54,283][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:38:54,610][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:38:54,939][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:38:55,268][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:38:55,595][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:38:55,921][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:38:56,248][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:38:56,574][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:38:56,901][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:38:57,228][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:38:57,554][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:38:57,882][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:38:58,212][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:38:58,539][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:38:58,866][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:38:59,194][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:38:59,523][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:38:59,850][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:39:00,178][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:39:00,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:39:01,227][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:39:01,929][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:39:01,931][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:39:01,933][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:39:03,844][__main__][INFO] - Iteration 261 took 24s (37.96% Gen, 54.16% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 37m 53s. Estimated total time: 20h 13m 7s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 26s, 500 more iterations: 3h 22m 11s. [2025-11-13 09:39:03,846][__main__][INFO] - Starting iteration 261. [2025-11-13 09:39:03,850][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:39:03,850][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:39:12,925][__main__][INFO] - Number of regex retries in iteration 261: 0 [2025-11-13 09:39:12,926][__main__][INFO] - agents played in iteration 261 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:39:13,365][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:13,403][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:13,437][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:13,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:13,471][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:39:13,472][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:39:14,184][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:39:14,481][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:39:14,809][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:39:15,140][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:39:15,471][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:39:15,800][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:39:16,128][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:39:16,459][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:39:16,789][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:39:17,121][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:39:17,452][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:39:17,777][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:39:18,106][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:39:18,435][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:39:18,762][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:39:19,090][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:39:19,418][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:39:19,745][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:39:20,072][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:39:20,399][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:39:20,725][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:39:21,054][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:39:21,381][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:39:21,707][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:39:22,034][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:39:22,360][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:39:22,687][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:39:23,013][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:39:23,341][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:39:23,672][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:39:24,000][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:39:24,327][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:39:24,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:39:25,380][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:39:26,087][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:39:26,088][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:39:26,090][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:39:27,041][__main__][INFO] - Iteration 262 took 23s (39.13% Gen, 56.76% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 43m 57s. Estimated total time: 19h 19m 35s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 39s, 500 more iterations: 3h 13m 15s. [2025-11-13 09:39:27,043][__main__][INFO] - Starting iteration 262. [2025-11-13 09:39:27,046][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:39:27,047][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:39:36,212][__main__][INFO] - Number of regex retries in iteration 262: 0 [2025-11-13 09:39:36,213][__main__][INFO] - agents played in iteration 262 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:39:36,645][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:36,679][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:36,712][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:36,745][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:36,746][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:39:36,746][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:39:37,460][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:39:37,757][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:39:38,083][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:39:38,411][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:39:38,738][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:39:39,066][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:39:39,393][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:39:39,719][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:39:40,048][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:39:40,375][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:39:40,702][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:39:41,033][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:39:41,362][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:39:41,689][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:39:42,016][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:39:42,343][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:39:42,669][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:39:42,996][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:39:43,325][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:39:43,652][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:39:43,977][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:39:44,304][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:39:44,632][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:39:44,959][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:39:45,288][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:39:45,615][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:39:45,941][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:39:46,267][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:39:46,593][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:39:46,919][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:39:47,248][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:39:47,575][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:39:47,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:39:48,626][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:39:49,335][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:39:49,336][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:39:49,339][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:39:50,549][__main__][INFO] - Iteration 263 took 23s (39.00% Gen, 55.85% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 59m 9s. Estimated total time: 19h 35m 10s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 10s, 500 more iterations: 3h 15m 51s. [2025-11-13 09:39:50,551][__main__][INFO] - Starting iteration 263. [2025-11-13 09:39:50,554][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:39:50,554][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:39:59,779][__main__][INFO] - Number of regex retries in iteration 263: 0 [2025-11-13 09:39:59,780][__main__][INFO] - agents played in iteration 263 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:40:00,212][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:00,249][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:00,283][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:00,317][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:00,317][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:40:00,318][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:40:01,026][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:40:01,323][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:40:01,651][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:40:01,978][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:40:02,305][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:40:02,635][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:40:02,963][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:40:03,290][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:40:03,617][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:40:03,943][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:40:04,268][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:40:04,595][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:40:04,924][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:40:05,251][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:40:05,577][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:40:05,905][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:40:06,231][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:40:06,559][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:40:06,885][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:40:07,212][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:40:07,539][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:40:07,867][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:40:08,194][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:40:08,520][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:40:08,846][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:40:09,173][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:40:09,500][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:40:09,827][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:40:10,153][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:40:10,479][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:40:10,806][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:40:11,134][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:40:11,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:40:12,187][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:40:12,889][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:40:12,891][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:40:12,892][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:40:13,821][__main__][INFO] - Iteration 264 took 23s (39.65% Gen, 56.36% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 46m 58s. Estimated total time: 19h 23m 23s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 46s, 500 more iterations: 3h 13m 53s. [2025-11-13 09:40:13,823][__main__][INFO] - Starting iteration 264. [2025-11-13 09:40:13,825][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:40:13,826][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:40:22,340][__main__][INFO] - Number of regex retries in iteration 264: 0 [2025-11-13 09:40:22,341][__main__][INFO] - agents played in iteration 264 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:40:22,780][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:22,814][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:22,847][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:22,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:22,881][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:40:22,882][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:40:23,602][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:40:23,900][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:40:24,227][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:40:24,555][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:40:24,888][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:40:25,220][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:40:25,553][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:40:25,883][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:40:26,213][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:40:26,544][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:40:26,871][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:40:27,198][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:40:27,527][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:40:27,854][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:40:28,182][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:40:28,508][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:40:28,834][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:40:29,162][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:40:29,489][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:40:29,816][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:40:30,144][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:40:30,471][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:40:30,799][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:40:31,127][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:40:31,457][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:40:31,787][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:40:32,113][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:40:32,440][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:40:32,768][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:40:33,095][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:40:33,422][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:40:33,749][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:40:34,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:40:34,799][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:40:35,517][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:40:35,519][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:40:35,521][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:40:36,716][__main__][INFO] - Iteration 265 took 22s (37.20% Gen, 57.58% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 27m 45s. Estimated total time: 19h 4m 33s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 9s, 500 more iterations: 3h 10m 45s. [2025-11-13 09:40:36,718][__main__][INFO] - Starting iteration 265. [2025-11-13 09:40:36,721][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:40:36,722][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:40:45,867][__main__][INFO] - Number of regex retries in iteration 265: 0 [2025-11-13 09:40:45,868][__main__][INFO] - agents played in iteration 265 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:40:46,303][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:46,337][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:46,371][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:46,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:46,406][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:40:46,406][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:40:47,126][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:40:47,423][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:40:47,748][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:40:48,081][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:40:48,408][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:40:48,736][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:40:49,063][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:40:49,390][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:40:49,716][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:40:50,043][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:40:50,371][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:40:50,698][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:40:51,024][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:40:51,350][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:40:51,678][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:40:52,005][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:40:52,334][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:40:52,660][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:40:52,988][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:40:53,313][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:40:53,640][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:40:53,969][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:40:54,301][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:40:54,630][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:40:54,956][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:40:55,283][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:40:55,608][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:40:55,936][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:40:56,265][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:40:56,593][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:40:56,920][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:40:57,248][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:40:57,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:40:58,305][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:40:59,012][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:40:59,013][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:40:59,015][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:40:59,998][__main__][INFO] - Iteration 266 took 23s (39.29% Gen, 56.48% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 46m 42s. Estimated total time: 19h 23m 52s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 47s, 500 more iterations: 3h 13m 58s. [2025-11-13 09:41:00,000][__main__][INFO] - Starting iteration 266. [2025-11-13 09:41:00,003][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:41:00,003][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:41:08,845][__main__][INFO] - Number of regex retries in iteration 266: 0 [2025-11-13 09:41:08,846][__main__][INFO] - agents played in iteration 266 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:41:09,278][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:09,312][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:09,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:09,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:09,380][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:41:09,380][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:41:10,103][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:41:10,399][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:41:10,726][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:41:11,054][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:41:11,382][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:41:11,708][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:41:12,039][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:41:12,367][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:41:12,699][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:41:13,024][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:41:13,352][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:41:13,680][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:41:14,007][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:41:14,333][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:41:14,661][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:41:14,991][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:41:15,319][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:41:15,646][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:41:15,974][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:41:16,302][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:41:16,628][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:41:16,954][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:41:17,282][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:41:17,611][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:41:17,937][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:41:18,265][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:41:18,592][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:41:18,919][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:41:19,247][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:41:19,573][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:41:19,901][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:41:20,228][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:41:20,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:41:21,278][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:41:21,982][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:41:21,983][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:41:21,985][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:41:22,938][__main__][INFO] - Iteration 267 took 22s (38.55% Gen, 57.28% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 29m 13s. Estimated total time: 19h 6m 47s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 13s, 500 more iterations: 3h 11m 7s. [2025-11-13 09:41:22,940][__main__][INFO] - Starting iteration 267. [2025-11-13 09:41:22,944][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:41:22,945][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:41:32,017][__main__][INFO] - Number of regex retries in iteration 267: 0 [2025-11-13 09:41:32,017][__main__][INFO] - agents played in iteration 267 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:41:32,465][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:32,499][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:32,533][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:32,567][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:32,568][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:41:32,568][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:41:33,301][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:41:33,598][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:41:33,924][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:41:34,256][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:41:34,586][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:41:34,912][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:41:35,242][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:41:35,570][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:41:35,896][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:41:36,223][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:41:36,549][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:41:36,875][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:41:37,203][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:41:37,530][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:41:37,858][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:41:38,186][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:41:38,517][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:41:38,848][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:41:39,176][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:41:39,508][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:41:39,840][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:41:40,169][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:41:40,495][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:41:40,821][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:41:41,148][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:41:41,474][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:41:41,800][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:41:42,127][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:41:42,455][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:41:42,783][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:41:43,112][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:41:43,439][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:41:43,766][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:41:44,493][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:41:45,200][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:41:45,201][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:41:45,203][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:41:46,162][__main__][INFO] - Iteration 268 took 23s (39.07% Gen, 56.79% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 42m 59s. Estimated total time: 19h 20m 56s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 41s, 500 more iterations: 3h 13m 29s. [2025-11-13 09:41:46,164][__main__][INFO] - Starting iteration 268. [2025-11-13 09:41:46,167][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:41:46,167][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:41:55,206][__main__][INFO] - Number of regex retries in iteration 268: 0 [2025-11-13 09:41:55,207][__main__][INFO] - agents played in iteration 268 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:41:55,646][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:55,682][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:55,716][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:55,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:55,751][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:41:55,751][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:41:56,453][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:41:56,749][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:41:57,076][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:41:57,405][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:41:57,737][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:41:58,064][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:41:58,391][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:41:58,718][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:41:59,046][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:41:59,373][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:41:59,699][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:42:00,027][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:42:00,352][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:42:00,678][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:42:01,004][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:42:01,331][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:42:01,659][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:42:01,986][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:42:02,312][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:42:02,642][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:42:02,971][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:42:03,298][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:42:03,633][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:42:03,961][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:42:04,288][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:42:04,615][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:42:04,943][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:42:05,269][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:42:05,597][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:42:05,923][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:42:06,252][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:42:06,584][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:42:06,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:42:07,633][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:42:08,349][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:42:08,354][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:42:08,356][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:42:09,298][__main__][INFO] - Iteration 269 took 23s (39.07% Gen, 56.85% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 38m 18s. Estimated total time: 19h 16m 38s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 33s, 500 more iterations: 3h 12m 46s. [2025-11-13 09:42:09,301][__main__][INFO] - Starting iteration 269. [2025-11-13 09:42:09,303][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:42:09,304][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:42:17,892][__main__][INFO] - Number of regex retries in iteration 269: 0 [2025-11-13 09:42:17,893][__main__][INFO] - agents played in iteration 269 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:42:18,348][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:18,381][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:18,414][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:18,447][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:18,448][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:42:18,449][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:42:19,173][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:42:19,471][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:42:19,799][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:42:20,127][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:42:20,459][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:42:20,788][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:42:21,119][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:42:21,445][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:42:21,772][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:42:22,099][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:42:22,426][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:42:22,758][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:42:23,085][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:42:23,414][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:42:23,741][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:42:24,068][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:42:24,396][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:42:24,721][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:42:25,048][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:42:25,375][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:42:25,703][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:42:26,030][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:42:26,358][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:42:26,684][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:42:27,010][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:42:27,337][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:42:27,664][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:42:27,990][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:42:28,317][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:42:28,644][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:42:28,971][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:42:29,298][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:42:29,625][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:42:30,354][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:42:31,085][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:42:31,086][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:42:31,088][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:42:32,075][__main__][INFO] - Iteration 270 took 22s (37.71% Gen, 57.94% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 19m 56s. Estimated total time: 18h 58m 39s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 57s, 500 more iterations: 3h 9m 46s. [2025-11-13 09:42:32,077][__main__][INFO] - Starting iteration 270. [2025-11-13 09:42:32,081][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:42:32,081][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:42:41,123][__main__][INFO] - Number of regex retries in iteration 270: 0 [2025-11-13 09:42:41,123][__main__][INFO] - agents played in iteration 270 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:42:41,548][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:41,582][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:41,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:41,648][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:41,649][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:42:41,649][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:42:42,382][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:42:42,680][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:42:43,007][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:42:43,339][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:42:43,669][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:42:43,994][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:42:44,322][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:42:44,648][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:42:44,975][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:42:45,302][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:42:45,629][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:42:45,955][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:42:46,282][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:42:46,610][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:42:46,936][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:42:47,263][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:42:47,590][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:42:47,918][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:42:48,246][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:42:48,576][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:42:48,903][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:42:49,229][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:42:49,557][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:42:49,885][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:42:50,212][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:42:50,539][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:42:50,865][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:42:51,192][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:42:51,520][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:42:51,845][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:42:52,172][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:42:52,499][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:42:52,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:42:53,559][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:42:54,284][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:42:54,289][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:42:54,291][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:42:56,408][__main__][INFO] - Iteration 271 took 24s (37.17% Gen, 54.12% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 37m 17s. Estimated total time: 20h 16m 25s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 32s, 500 more iterations: 3h 22m 44s. [2025-11-13 09:42:56,410][__main__][INFO] - Starting iteration 271. [2025-11-13 09:42:56,414][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:42:56,415][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:43:06,162][__main__][INFO] - Number of regex retries in iteration 271: 0 [2025-11-13 09:43:06,162][__main__][INFO] - agents played in iteration 271 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:43:06,604][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:06,641][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:06,675][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:06,709][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:06,709][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:43:06,710][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:43:07,431][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:43:07,730][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:43:08,058][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:43:08,384][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:43:08,709][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:43:09,036][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:43:09,365][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:43:09,692][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:43:10,019][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:43:10,345][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:43:10,674][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:43:11,001][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:43:11,328][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:43:11,654][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:43:11,980][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:43:12,306][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:43:12,633][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:43:12,960][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:43:13,286][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:43:13,613][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:43:13,942][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:43:14,271][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:43:14,597][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:43:14,923][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:43:15,252][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:43:15,580][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:43:15,909][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:43:16,237][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:43:16,564][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:43:16,890][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:43:17,217][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:43:17,544][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:43:17,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:43:18,593][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:43:19,328][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:43:19,330][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:43:19,331][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:43:20,395][__main__][INFO] - Iteration 272 took 23s (40.64% Gen, 54.91% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 19m 36s. Estimated total time: 19h 59m 7s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 58s, 500 more iterations: 3h 19m 51s. [2025-11-13 09:43:20,397][__main__][INFO] - Starting iteration 272. [2025-11-13 09:43:20,400][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:43:20,401][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:43:29,450][__main__][INFO] - Number of regex retries in iteration 272: 0 [2025-11-13 09:43:29,451][__main__][INFO] - agents played in iteration 272 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:43:29,896][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:29,929][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:29,963][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:29,997][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:29,998][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:43:29,999][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:43:30,727][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:43:31,025][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:43:31,351][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:43:31,678][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:43:32,005][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:43:32,332][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:43:32,658][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:43:32,985][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:43:33,312][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:43:33,638][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:43:33,966][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:43:34,292][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:43:34,619][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:43:34,945][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:43:35,273][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:43:35,601][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:43:35,929][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:43:36,255][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:43:36,583][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:43:36,910][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:43:37,237][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:43:37,565][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:43:37,892][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:43:38,220][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:43:38,547][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:43:38,874][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:43:39,201][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:43:39,529][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:43:39,856][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:43:40,181][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:43:40,507][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:43:40,836][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:43:41,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:43:41,873][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:43:42,603][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:43:42,605][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:43:42,607][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:43:43,574][__main__][INFO] - Iteration 273 took 23s (39.05% Gen, 56.77% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 38m 50s. Estimated total time: 19h 18m 44s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 37s, 500 more iterations: 3h 13m 7s. [2025-11-13 09:43:43,576][__main__][INFO] - Starting iteration 273. [2025-11-13 09:43:43,579][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:43:43,580][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:43:52,356][__main__][INFO] - Number of regex retries in iteration 273: 0 [2025-11-13 09:43:52,357][__main__][INFO] - agents played in iteration 273 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:43:52,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:52,830][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:52,865][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:52,899][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:52,900][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:43:52,900][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:43:53,642][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:43:53,940][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:43:54,267][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:43:54,593][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:43:54,920][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:43:55,246][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:43:55,573][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:43:55,900][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:43:56,227][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:43:56,553][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:43:56,880][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:43:57,206][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:43:57,533][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:43:57,860][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:43:58,187][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:43:58,516][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:43:58,842][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:43:59,169][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:43:59,499][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:43:59,825][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:44:00,151][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:44:00,477][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:44:00,805][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:44:01,132][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:44:01,458][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:44:01,785][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:44:02,112][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:44:02,439][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:44:02,766][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:44:03,091][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:44:03,417][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:44:03,743][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:44:04,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:44:04,781][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:44:05,510][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:44:05,511][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:44:05,512][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:44:06,507][__main__][INFO] - Iteration 274 took 22s (38.28% Gen, 57.38% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 26m 6s. Estimated total time: 19h 6m 24s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 12s, 500 more iterations: 3h 11m 4s. [2025-11-13 09:44:06,518][__main__][INFO] - Starting iteration 274. [2025-11-13 09:44:06,522][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:44:06,522][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:44:15,912][__main__][INFO] - Number of regex retries in iteration 274: 0 [2025-11-13 09:44:15,913][__main__][INFO] - agents played in iteration 274 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:44:16,361][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:16,395][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:16,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:16,462][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:16,463][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:44:16,463][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:44:17,184][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:44:17,481][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:44:17,808][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:44:18,136][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:44:18,463][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:44:18,790][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:44:19,117][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:44:19,445][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:44:19,772][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:44:20,100][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:44:20,426][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:44:20,752][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:44:21,079][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:44:21,406][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:44:21,732][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:44:22,059][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:44:22,385][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:44:22,711][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:44:23,037][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:44:23,363][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:44:23,689][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:44:24,015][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:44:24,342][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:44:24,669][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:44:24,996][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:44:25,323][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:44:25,650][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:44:25,978][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:44:26,303][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:44:26,631][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:44:26,959][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:44:27,287][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:44:27,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:44:28,321][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:44:29,052][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:44:29,053][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:44:29,055][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:44:30,014][__main__][INFO] - Iteration 275 took 23s (39.97% Gen, 55.94% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 53m 59s. Estimated total time: 19h 34m 40s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 9s, 500 more iterations: 3h 15m 46s. [2025-11-13 09:44:30,016][__main__][INFO] - Starting iteration 275. [2025-11-13 09:44:30,019][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:44:30,020][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:44:39,373][__main__][INFO] - Number of regex retries in iteration 275: 0 [2025-11-13 09:44:39,374][__main__][INFO] - agents played in iteration 275 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:44:39,830][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:39,863][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:39,897][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:39,930][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:39,931][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:44:39,931][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:44:40,650][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:44:40,947][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:44:41,280][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:44:41,609][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:44:41,936][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:44:42,262][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:44:42,593][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:44:42,919][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:44:43,247][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:44:43,574][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:44:43,900][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:44:44,226][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:44:44,554][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:44:44,880][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:44:45,207][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:44:45,534][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:44:45,861][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:44:46,189][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:44:46,515][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:44:46,842][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:44:47,169][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:44:47,495][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:44:47,823][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:44:48,150][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:44:48,478][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:44:48,806][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:44:49,131][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:44:49,457][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:44:49,784][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:44:50,111][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:44:50,437][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:44:50,764][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:44:51,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:44:51,795][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:44:52,529][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:44:52,530][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:44:52,532][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:44:53,484][__main__][INFO] - Iteration 276 took 23s (39.86% Gen, 56.07% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 52m 13s. Estimated total time: 19h 33m 17s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 6s, 500 more iterations: 3h 15m 32s. [2025-11-13 09:44:53,486][__main__][INFO] - Starting iteration 276. [2025-11-13 09:44:53,489][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:44:53,490][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:45:02,647][__main__][INFO] - Number of regex retries in iteration 276: 0 [2025-11-13 09:45:02,648][__main__][INFO] - agents played in iteration 276 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:45:03,092][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:03,126][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:03,160][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:03,194][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:03,195][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:45:03,196][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:45:04,264][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:45:04,563][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:45:04,892][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:45:05,218][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:45:05,545][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:45:05,871][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:45:06,197][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:45:06,524][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:45:06,852][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:45:07,180][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:45:07,507][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:45:07,834][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:45:08,161][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:45:08,488][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:45:08,815][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:45:09,141][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:45:09,468][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:45:09,795][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:45:10,122][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:45:10,449][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:45:10,777][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:45:11,103][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:45:11,429][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:45:11,757][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:45:12,086][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:45:12,412][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:45:12,738][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:45:13,064][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:45:13,397][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:45:13,728][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:45:14,058][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:45:14,388][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:45:14,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:45:15,436][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:45:16,168][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:45:16,169][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:45:16,170][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:45:17,133][__main__][INFO] - Iteration 277 took 23s (38.73% Gen, 57.19% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 0m 47s. Estimated total time: 19h 42m 15s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 24s, 500 more iterations: 3h 17m 2s. [2025-11-13 09:45:17,135][__main__][INFO] - Starting iteration 277. [2025-11-13 09:45:17,138][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:45:17,139][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:45:26,547][__main__][INFO] - Number of regex retries in iteration 277: 0 [2025-11-13 09:45:26,548][__main__][INFO] - agents played in iteration 277 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:45:26,990][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:27,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:27,058][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:27,092][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:27,092][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:45:27,092][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:45:27,825][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:45:28,123][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:45:28,452][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:45:28,779][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:45:29,106][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:45:29,434][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:45:29,766][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:45:30,093][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:45:30,419][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:45:30,746][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:45:31,072][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:45:31,399][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:45:31,725][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:45:32,052][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:45:32,379][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:45:32,708][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:45:33,037][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:45:33,363][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:45:33,689][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:45:34,017][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:45:34,344][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:45:34,669][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:45:34,996][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:45:35,325][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:45:35,651][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:45:35,978][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:45:36,306][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:45:36,635][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:45:36,961][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:45:37,288][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:45:37,615][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:45:37,942][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:45:38,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:45:38,983][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:45:39,718][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:45:39,719][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:45:39,721][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:45:40,821][__main__][INFO] - Iteration 278 took 23s (39.73% Gen, 55.62% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 2m 21s. Estimated total time: 19h 44m 13s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 28s, 500 more iterations: 3h 17m 22s. [2025-11-13 09:45:40,823][__main__][INFO] - Starting iteration 278. [2025-11-13 09:45:40,826][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:45:40,826][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:45:50,024][__main__][INFO] - Number of regex retries in iteration 278: 0 [2025-11-13 09:45:50,024][__main__][INFO] - agents played in iteration 278 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:45:50,467][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:50,502][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:50,538][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:50,573][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:50,573][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:45:50,574][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:45:51,303][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:45:51,600][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:45:51,928][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:45:52,256][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:45:52,584][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:45:52,910][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:45:53,239][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:45:53,567][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:45:53,895][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:45:54,222][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:45:54,549][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:45:54,878][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:45:55,205][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:45:55,533][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:45:55,861][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:45:56,191][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:45:56,517][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:45:56,844][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:45:57,170][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:45:57,496][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:45:57,823][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:45:58,148][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:45:58,474][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:45:58,806][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:45:59,134][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:45:59,461][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:45:59,787][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:46:00,113][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:46:00,439][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:46:00,765][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:46:01,094][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:46:01,421][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:46:01,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:46:02,459][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:46:03,185][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:46:03,186][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:46:03,188][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:46:04,157][__main__][INFO] - Iteration 279 took 23s (39.42% Gen, 56.42% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 44m 20s. Estimated total time: 19h 26m 35s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 53s, 500 more iterations: 3h 14m 25s. [2025-11-13 09:46:04,159][__main__][INFO] - Starting iteration 279. [2025-11-13 09:46:04,162][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:46:04,163][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:46:13,396][__main__][INFO] - Number of regex retries in iteration 279: 0 [2025-11-13 09:46:13,397][__main__][INFO] - agents played in iteration 279 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:46:13,841][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:13,875][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:13,910][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:13,944][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:13,944][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:46:13,945][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:46:14,685][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:46:14,983][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:46:15,310][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:46:15,636][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:46:15,962][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:46:16,288][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:46:16,614][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:46:16,940][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:46:17,267][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:46:17,598][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:46:17,925][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:46:18,251][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:46:18,579][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:46:18,906][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:46:19,233][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:46:19,559][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:46:19,887][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:46:20,213][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:46:20,540][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:46:20,866][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:46:21,193][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:46:21,519][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:46:21,847][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:46:22,175][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:46:22,502][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:46:22,828][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:46:23,154][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:46:23,480][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:46:23,807][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:46:24,135][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:46:24,462][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:46:24,788][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:46:25,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:46:25,836][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:46:26,556][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:46:26,557][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:46:26,559][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:46:27,532][__main__][INFO] - Iteration 280 took 23s (39.51% Gen, 56.32% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 45m 55s. Estimated total time: 19h 28m 33s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 57s, 500 more iterations: 3h 14m 45s. [2025-11-13 09:46:27,534][__main__][INFO] - Starting iteration 280. [2025-11-13 09:46:27,537][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:46:27,538][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:46:35,293][mllm.models.large_language_model_local][WARNING] - Response user Last round, the other agent played . did not match regex: (|), retry 1/1 [2025-11-13 09:46:36,824][__main__][INFO] - Number of regex retries in iteration 280: 1 [2025-11-13 09:46:36,824][__main__][INFO] - agents played in iteration 280 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:46:37,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:37,300][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:37,334][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:37,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:37,369][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:46:37,369][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:46:38,105][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:46:38,402][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:46:38,730][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:46:39,058][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:46:39,385][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:46:39,711][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:46:40,038][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:46:40,364][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:46:40,691][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:46:41,017][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:46:41,344][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:46:41,670][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:46:42,001][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:46:42,332][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:46:42,659][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:46:42,984][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:46:43,313][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:46:43,639][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:46:43,966][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:46:44,296][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:46:44,623][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:46:44,953][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:46:45,277][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:46:45,602][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:46:45,930][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:46:46,257][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:46:46,585][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:46:46,910][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:46:47,237][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:46:47,564][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:46:47,890][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:46:48,217][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:46:48,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:46:49,227][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:46:49,938][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:46:49,940][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:46:49,942][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:46:51,935][__main__][INFO] - Iteration 281 took 24s (38.06% Gen, 53.76% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 36m 51s. Estimated total time: 20h 19m 54s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 39s, 500 more iterations: 3h 23m 19s. [2025-11-13 09:46:51,937][__main__][INFO] - Starting iteration 281. [2025-11-13 09:46:51,940][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:46:51,941][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:47:01,514][__main__][INFO] - Number of regex retries in iteration 281: 0 [2025-11-13 09:47:01,515][__main__][INFO] - agents played in iteration 281 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:47:02,002][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:02,036][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:02,070][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:02,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:02,104][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:47:02,105][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:47:02,840][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:47:03,137][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:47:03,466][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:47:03,793][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:47:04,121][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:47:04,448][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:47:04,774][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:47:05,101][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:47:05,429][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:47:05,755][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:47:06,083][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:47:06,411][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:47:06,738][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:47:07,067][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:47:07,394][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:47:07,719][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:47:08,049][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:47:08,375][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:47:08,701][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:47:09,029][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:47:09,358][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:47:09,686][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:47:10,017][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:47:10,343][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:47:10,670][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:47:10,999][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:47:11,328][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:47:11,656][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:47:11,985][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:47:12,316][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:47:12,644][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:47:12,971][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:47:13,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:47:13,996][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:47:14,725][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:47:14,726][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:47:14,727][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:47:15,691][__main__][INFO] - Iteration 282 took 23s (40.31% Gen, 55.63% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 4m 7s. Estimated total time: 19h 47m 34s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 35s, 500 more iterations: 3h 17m 55s. [2025-11-13 09:47:15,693][__main__][INFO] - Starting iteration 282. [2025-11-13 09:47:15,696][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:47:15,697][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:47:24,967][__main__][INFO] - Number of regex retries in iteration 282: 0 [2025-11-13 09:47:24,968][__main__][INFO] - agents played in iteration 282 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:47:25,420][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:25,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:25,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:25,523][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:25,523][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:47:25,524][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:47:26,264][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:47:26,562][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:47:26,889][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:47:27,215][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:47:27,542][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:47:27,872][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:47:28,203][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:47:28,528][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:47:28,858][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:47:29,185][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:47:29,513][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:47:29,840][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:47:30,168][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:47:30,495][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:47:30,824][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:47:31,156][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:47:31,484][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:47:31,814][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:47:32,142][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:47:32,472][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:47:32,801][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:47:33,128][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:47:33,457][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:47:33,784][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:47:34,113][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:47:34,441][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:47:34,767][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:47:35,094][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:47:35,420][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:47:35,752][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:47:36,081][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:47:36,409][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:47:36,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:47:37,454][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:47:38,182][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:47:38,184][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:47:38,185][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:47:39,200][__main__][INFO] - Iteration 283 took 23s (39.44% Gen, 56.23% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 51m 25s. Estimated total time: 19h 35m 15s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 10s, 500 more iterations: 3h 15m 52s. [2025-11-13 09:47:39,202][__main__][INFO] - Starting iteration 283. [2025-11-13 09:47:39,206][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:47:39,207][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:47:48,480][__main__][INFO] - Number of regex retries in iteration 283: 0 [2025-11-13 09:47:48,481][__main__][INFO] - agents played in iteration 283 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:47:48,919][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:48,953][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:48,987][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:49,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:49,022][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:47:49,022][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:47:49,753][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:47:50,051][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:47:50,378][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:47:50,704][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:47:51,031][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:47:51,357][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:47:51,683][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:47:52,010][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:47:52,337][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:47:52,664][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:47:52,992][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:47:53,318][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:47:53,644][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:47:53,972][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:47:54,304][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:47:54,633][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:47:54,962][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:47:55,289][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:47:55,616][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:47:55,945][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:47:56,272][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:47:56,600][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:47:56,927][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:47:57,254][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:47:57,582][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:47:57,910][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:47:58,235][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:47:58,562][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:47:58,892][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:47:59,219][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:47:59,547][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:47:59,872][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:48:00,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:48:00,870][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:48:01,588][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:48:01,589][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:48:01,591][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:48:02,605][__main__][INFO] - Iteration 284 took 23s (39.63% Gen, 56.03% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 45m 46s. Estimated total time: 19h 29m 59s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 59s, 500 more iterations: 3h 14m 59s. [2025-11-13 09:48:02,607][__main__][INFO] - Starting iteration 284. [2025-11-13 09:48:02,610][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:48:02,611][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:48:12,047][__main__][INFO] - Number of regex retries in iteration 284: 0 [2025-11-13 09:48:12,048][__main__][INFO] - agents played in iteration 284 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:48:12,497][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:12,532][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:12,566][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:12,600][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:12,601][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:48:12,602][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:48:13,336][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:48:13,633][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:48:13,960][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:48:14,286][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:48:14,614][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:48:14,941][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:48:15,268][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:48:15,595][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:48:15,922][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:48:16,249][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:48:16,577][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:48:16,906][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:48:17,238][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:48:17,567][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:48:17,899][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:48:18,228][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:48:18,555][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:48:18,884][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:48:19,213][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:48:19,543][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:48:19,873][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:48:20,201][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:48:20,528][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:48:20,855][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:48:21,182][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:48:21,511][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:48:21,838][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:48:22,168][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:48:22,499][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:48:22,828][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:48:23,155][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:48:23,484][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:48:23,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:48:24,526][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:48:25,249][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:48:25,251][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:48:25,252][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:48:26,250][__main__][INFO] - Iteration 285 took 23s (39.92% Gen, 55.86% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 57m 25s. Estimated total time: 19h 42m 2s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 24s, 500 more iterations: 3h 17m 0s. [2025-11-13 09:48:26,252][__main__][INFO] - Starting iteration 285. [2025-11-13 09:48:26,256][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:48:26,256][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:48:35,764][__main__][INFO] - Number of regex retries in iteration 285: 0 [2025-11-13 09:48:35,765][__main__][INFO] - agents played in iteration 285 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:48:36,205][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:36,238][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:36,272][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:36,307][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:36,307][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:48:36,308][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:48:37,048][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:48:37,344][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:48:37,671][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:48:37,997][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:48:38,324][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:48:38,652][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:48:38,979][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:48:39,308][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:48:39,634][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:48:39,962][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:48:40,291][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:48:40,623][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:48:40,950][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:48:41,277][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:48:41,606][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:48:41,933][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:48:42,261][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:48:42,591][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:48:42,921][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:48:43,252][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:48:43,579][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:48:43,906][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:48:44,235][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:48:44,562][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:48:44,888][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:48:45,217][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:48:45,545][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:48:45,871][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:48:46,198][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:48:46,526][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:48:46,855][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:48:47,180][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:48:47,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:48:48,213][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:48:48,935][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:48:48,937][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:48:48,939][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:48:49,925][__main__][INFO] - Iteration 286 took 23s (40.17% Gen, 55.66% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 58m 30s. Estimated total time: 19h 43m 30s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 27s, 500 more iterations: 3h 17m 15s. [2025-11-13 09:48:49,927][__main__][INFO] - Starting iteration 286. [2025-11-13 09:48:49,931][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:48:49,931][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:48:59,896][__main__][INFO] - Number of regex retries in iteration 286: 0 [2025-11-13 09:48:59,896][__main__][INFO] - agents played in iteration 286 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:49:00,350][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:00,383][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:00,418][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:00,452][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:00,452][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:49:00,453][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:49:01,174][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:49:01,472][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:49:01,799][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:49:02,125][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:49:02,452][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:49:02,779][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:49:03,105][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:49:03,432][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:49:03,760][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:49:04,087][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:49:04,415][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:49:04,744][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:49:05,070][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:49:05,395][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:49:05,721][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:49:06,050][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:49:06,376][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:49:06,704][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:49:07,035][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:49:07,363][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:49:07,689][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:49:08,015][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:49:08,341][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:49:08,666][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:49:08,994][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:49:09,322][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:49:09,647][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:49:09,974][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:49:10,300][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:49:10,629][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:49:10,958][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:49:11,285][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:49:11,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:49:12,317][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:49:13,042][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:49:13,044][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:49:13,045][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:49:13,983][__main__][INFO] - Iteration 287 took 24s (41.43% Gen, 54.66% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 17m 17s. Estimated total time: 20h 2m 41s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 5s, 500 more iterations: 3h 20m 26s. [2025-11-13 09:49:13,986][__main__][INFO] - Starting iteration 287. [2025-11-13 09:49:14,058][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:49:14,058][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:49:23,043][__main__][INFO] - Number of regex retries in iteration 287: 0 [2025-11-13 09:49:23,044][__main__][INFO] - agents played in iteration 287 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:49:23,490][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:23,523][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:23,557][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:23,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:23,591][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:49:23,592][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:49:24,327][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:49:24,624][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:49:24,952][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:49:25,280][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:49:25,606][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:49:25,932][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:49:26,258][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:49:26,585][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:49:26,912][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:49:27,240][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:49:27,567][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:49:27,895][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:49:28,221][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:49:28,548][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:49:28,875][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:49:29,203][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:49:29,531][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:49:29,858][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:49:30,186][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:49:30,512][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:49:30,838][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:49:31,164][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:49:31,490][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:49:31,818][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:49:32,143][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:49:32,468][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:49:32,796][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:49:33,122][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:49:33,450][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:49:33,776][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:49:34,101][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:49:34,427][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:49:34,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:49:35,430][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:49:36,142][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:49:36,144][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:49:36,146][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:49:37,099][__main__][INFO] - Iteration 288 took 23s (38.88% Gen, 56.69% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 29m 47s. Estimated total time: 19h 15m 35s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 31s, 500 more iterations: 3h 12m 35s. [2025-11-13 09:49:37,102][__main__][INFO] - Starting iteration 288. [2025-11-13 09:49:37,105][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:49:37,106][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:49:46,866][__main__][INFO] - Number of regex retries in iteration 288: 0 [2025-11-13 09:49:46,866][__main__][INFO] - agents played in iteration 288 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:49:47,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:47,344][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:47,378][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:47,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:47,413][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:49:47,413][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:49:48,145][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:49:48,443][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:49:48,770][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:49:49,097][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:49:49,430][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:49:49,761][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:49:50,090][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:49:50,418][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:49:50,746][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:49:51,073][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:49:51,401][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:49:51,728][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:49:52,059][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:49:52,390][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:49:52,723][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:49:53,049][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:49:53,376][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:49:53,704][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:49:54,033][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:49:54,364][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:49:54,693][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:49:55,020][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:49:55,350][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:49:55,681][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:49:56,009][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:49:56,341][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:49:56,670][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:49:57,001][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:49:57,329][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:49:57,655][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:49:57,982][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:49:58,313][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:49:58,639][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:49:59,339][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:50:00,082][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:50:00,083][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:50:00,085][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:50:01,337][__main__][INFO] - Iteration 289 took 24s (40.28% Gen, 54.55% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 25m 26s. Estimated total time: 20h 11m 38s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 23s, 500 more iterations: 3h 21m 56s. [2025-11-13 09:50:01,339][__main__][INFO] - Starting iteration 289. [2025-11-13 09:50:01,342][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:50:01,343][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:50:10,949][__main__][INFO] - Number of regex retries in iteration 289: 0 [2025-11-13 09:50:10,950][__main__][INFO] - agents played in iteration 289 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:50:11,421][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:11,455][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:11,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:11,524][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:11,524][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:50:11,524][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:50:12,279][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:50:12,578][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:50:12,905][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:50:13,237][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:50:13,564][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:50:13,891][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:50:14,217][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:50:14,543][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:50:14,870][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:50:15,201][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:50:15,530][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:50:15,859][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:50:16,191][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:50:16,518][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:50:16,847][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:50:17,174][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:50:17,500][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:50:17,827][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:50:18,153][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:50:18,479][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:50:18,806][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:50:19,131][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:50:19,457][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:50:19,783][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:50:20,109][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:50:20,435][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:50:20,761][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:50:21,087][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:50:21,413][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:50:21,738][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:50:22,064][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:50:22,390][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:50:22,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:50:23,409][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:50:24,141][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:50:24,142][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:50:24,144][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:50:25,169][__main__][INFO] - Iteration 290 took 23s (40.32% Gen, 55.38% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 4m 46s. Estimated total time: 19h 51m 22s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 42s, 500 more iterations: 3h 18m 33s. [2025-11-13 09:50:25,171][__main__][INFO] - Starting iteration 290. [2025-11-13 09:50:25,174][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:50:25,174][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:50:34,431][__main__][INFO] - Number of regex retries in iteration 290: 0 [2025-11-13 09:50:34,432][__main__][INFO] - agents played in iteration 290 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:50:34,874][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:34,907][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:34,941][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:34,975][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:34,976][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:50:34,976][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:50:35,706][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:50:36,002][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:50:36,330][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:50:36,656][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:50:36,983][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:50:37,310][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:50:37,635][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:50:37,962][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:50:38,288][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:50:38,616][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:50:38,944][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:50:39,271][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:50:39,600][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:50:39,928][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:50:40,256][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:50:40,583][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:50:40,910][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:50:41,241][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:50:41,569][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:50:41,894][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:50:42,223][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:50:42,550][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:50:42,877][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:50:43,202][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:50:43,527][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:50:43,854][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:50:44,182][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:50:44,510][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:50:44,838][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:50:45,165][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:50:45,490][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:50:45,817][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:50:46,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:50:46,828][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:50:47,557][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:50:47,559][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:50:47,560][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:50:49,982][__main__][INFO] - Iteration 291 took 24s (37.32% Gen, 52.92% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 53m 26s. Estimated total time: 20h 40m 27s. Time estimates for 10 more iterations: 4m 8s, 100 more iterations: 41m 20s, 500 more iterations: 3h 26m 44s. [2025-11-13 09:50:49,984][__main__][INFO] - Starting iteration 291. [2025-11-13 09:50:49,988][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:50:49,988][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:50:59,123][__main__][INFO] - Number of regex retries in iteration 291: 0 [2025-11-13 09:50:59,124][__main__][INFO] - agents played in iteration 291 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:50:59,565][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:59,599][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:59,634][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:00,010][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:00,011][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:51:00,011][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:51:00,755][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:51:01,051][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:51:01,378][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:51:01,704][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:51:02,031][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:51:02,358][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:51:02,686][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:51:03,014][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:51:03,342][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:51:03,668][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:51:04,001][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:51:04,329][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:51:04,659][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:51:04,986][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:51:05,317][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:51:05,643][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:51:05,971][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:51:06,297][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:51:06,622][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:51:06,950][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:51:07,278][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:51:07,606][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:51:07,935][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:51:08,260][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:51:08,586][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:51:08,913][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:51:09,242][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:51:09,570][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:51:09,896][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:51:10,221][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:51:10,548][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:51:10,873][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:51:11,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:51:11,914][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:51:12,653][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:51:12,654][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:51:12,656][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:51:13,636][__main__][INFO] - Iteration 292 took 23s (38.63% Gen, 57.22% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 55m 3s. Estimated total time: 19h 42m 28s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 24s, 500 more iterations: 3h 17m 4s. [2025-11-13 09:51:13,638][__main__][INFO] - Starting iteration 292. [2025-11-13 09:51:13,641][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:51:13,642][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:51:23,632][__main__][INFO] - Number of regex retries in iteration 292: 0 [2025-11-13 09:51:23,632][__main__][INFO] - agents played in iteration 292 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:51:24,079][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:24,113][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:24,147][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:24,180][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:24,181][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:51:24,181][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:51:24,888][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:51:25,184][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:51:25,513][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:51:25,841][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:51:26,170][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:51:26,498][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:51:26,824][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:51:27,151][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:51:27,478][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:51:27,810][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:51:28,139][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:51:28,468][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:51:28,796][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:51:29,127][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:51:29,459][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:51:29,787][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:51:30,112][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:51:30,439][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:51:30,765][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:51:31,092][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:51:31,419][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:51:31,743][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:51:32,071][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:51:32,399][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:51:32,725][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:51:33,053][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:51:33,379][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:51:33,706][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:51:34,032][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:51:34,363][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:51:34,691][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:51:35,016][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:51:35,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:51:36,178][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:51:36,906][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:51:36,907][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:51:36,909][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:51:37,794][__main__][INFO] - Iteration 293 took 24s (41.36% Gen, 54.97% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 19m 51s. Estimated total time: 20h 7m 40s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 15s, 500 more iterations: 3h 21m 16s. [2025-11-13 09:51:37,796][__main__][INFO] - Starting iteration 293. [2025-11-13 09:51:37,798][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:51:37,799][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:51:46,808][__main__][INFO] - Number of regex retries in iteration 293: 0 [2025-11-13 09:51:46,808][__main__][INFO] - agents played in iteration 293 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:51:47,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:47,294][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:47,328][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:47,363][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:47,364][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:51:47,365][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:51:48,440][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:51:48,736][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:51:49,064][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:51:49,392][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:51:49,719][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:51:50,045][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:51:50,371][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:51:50,697][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:51:51,024][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:51:51,351][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:51:51,683][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:51:52,011][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:51:52,342][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:51:52,675][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:51:53,004][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:51:53,334][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:51:53,663][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:51:53,991][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:51:54,321][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:51:54,648][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:51:54,974][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:51:55,300][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:51:55,626][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:51:55,953][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:51:56,277][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:51:56,603][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:51:56,931][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:51:57,257][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:51:57,584][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:51:57,911][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:51:58,238][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:51:58,565][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:51:58,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:51:59,616][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:52:00,359][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:52:00,361][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:52:00,362][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:52:01,265][__main__][INFO] - Iteration 294 took 23s (38.39% Gen, 57.76% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 45m 10s. Estimated total time: 19h 33m 22s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 6s, 500 more iterations: 3h 15m 33s. [2025-11-13 09:52:01,267][__main__][INFO] - Starting iteration 294. [2025-11-13 09:52:01,270][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:52:01,270][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:52:10,581][__main__][INFO] - Number of regex retries in iteration 294: 0 [2025-11-13 09:52:10,582][__main__][INFO] - agents played in iteration 294 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:52:11,050][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:11,083][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:11,117][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:11,150][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:11,151][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:52:11,151][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:52:11,873][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:52:12,170][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:52:12,496][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:52:12,823][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:52:13,149][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:52:13,476][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:52:13,801][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:52:14,128][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:52:14,455][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:52:14,782][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:52:15,108][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:52:15,434][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:52:15,761][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:52:16,091][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:52:16,416][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:52:16,743][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:52:17,071][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:52:17,398][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:52:17,723][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:52:18,049][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:52:18,375][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:52:18,701][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:52:19,028][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:52:19,354][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:52:19,680][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:52:20,006][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:52:20,331][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:52:20,657][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:52:20,983][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:52:21,309][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:52:21,635][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:52:21,962][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:52:22,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:52:22,992][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:52:23,714][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:52:23,715][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:52:23,717][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:52:24,650][__main__][INFO] - Iteration 295 took 23s (39.82% Gen, 56.18% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 40m 29s. Estimated total time: 19h 29m 4s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 58s, 500 more iterations: 3h 14m 50s. [2025-11-13 09:52:24,652][__main__][INFO] - Starting iteration 295. [2025-11-13 09:52:24,655][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:52:24,656][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:52:33,637][__main__][INFO] - Number of regex retries in iteration 295: 0 [2025-11-13 09:52:33,638][__main__][INFO] - agents played in iteration 295 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:52:34,082][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:34,117][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:34,151][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:34,185][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:34,186][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:52:34,186][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:52:34,923][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:52:35,219][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:52:35,546][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:52:35,873][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:52:36,201][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:52:36,528][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:52:36,856][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:52:37,184][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:52:37,509][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:52:37,837][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:52:38,163][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:52:38,489][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:52:38,815][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:52:39,144][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:52:39,471][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:52:39,798][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:52:40,124][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:52:40,452][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:52:40,779][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:52:41,105][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:52:41,431][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:52:41,759][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:52:42,086][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:52:42,411][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:52:42,737][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:52:43,063][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:52:43,389][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:52:43,717][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:52:44,045][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:52:44,370][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:52:44,698][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:52:45,029][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:52:45,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:52:46,046][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:52:46,770][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:52:46,771][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:52:46,773][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:52:47,858][__main__][INFO] - Iteration 296 took 23s (38.71% Gen, 56.61% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 31m 13s. Estimated total time: 19h 20m 12s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 40s, 500 more iterations: 3h 13m 22s. [2025-11-13 09:52:47,860][__main__][INFO] - Starting iteration 296. [2025-11-13 09:52:47,864][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:52:47,864][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:52:57,335][__main__][INFO] - Number of regex retries in iteration 296: 0 [2025-11-13 09:52:57,336][__main__][INFO] - agents played in iteration 296 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:52:57,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:57,819][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:57,853][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:57,888][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:57,888][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:52:57,888][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:52:58,622][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:52:58,920][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:52:59,247][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:52:59,575][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:52:59,903][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:53:00,230][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:53:00,557][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:53:00,886][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:53:01,217][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:53:01,543][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:53:01,871][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:53:02,199][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:53:02,526][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:53:02,853][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:53:03,180][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:53:03,511][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:53:03,840][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:53:04,166][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:53:04,491][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:53:04,817][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:53:05,147][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:53:05,473][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:53:05,798][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:53:06,128][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:53:06,455][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:53:06,783][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:53:07,108][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:53:07,434][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:53:07,761][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:53:08,089][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:53:08,416][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:53:08,744][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:53:09,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:53:09,788][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:53:10,525][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:53:10,526][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:53:10,528][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:53:11,760][__main__][INFO] - Iteration 297 took 23s (39.64% Gen, 55.20% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 5m 30s. Estimated total time: 19h 54m 52s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 49s, 500 more iterations: 3h 19m 8s. [2025-11-13 09:53:11,762][__main__][INFO] - Starting iteration 297. [2025-11-13 09:53:11,765][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:53:11,766][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:53:21,351][__main__][INFO] - Number of regex retries in iteration 297: 0 [2025-11-13 09:53:21,352][__main__][INFO] - agents played in iteration 297 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:53:21,798][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:21,832][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:21,866][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:21,901][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:21,901][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:53:21,901][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:53:22,621][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:53:22,916][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:53:23,244][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:53:23,572][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:53:23,898][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:53:24,226][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:53:24,552][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:53:24,880][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:53:25,205][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:53:25,530][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:53:25,857][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:53:26,184][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:53:26,510][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:53:26,839][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:53:27,167][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:53:27,496][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:53:27,824][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:53:28,150][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:53:28,476][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:53:28,801][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:53:29,128][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:53:29,455][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:53:29,782][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:53:30,107][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:53:30,434][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:53:30,760][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:53:31,088][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:53:31,414][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:53:31,741][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:53:32,067][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:53:32,398][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:53:32,725][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:53:33,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:53:33,747][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:53:34,495][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:53:34,496][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:53:34,498][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:53:35,446][__main__][INFO] - Iteration 298 took 23s (40.48% Gen, 55.51% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 54m 19s. Estimated total time: 19h 44m 5s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 28s, 500 more iterations: 3h 17m 20s. [2025-11-13 09:53:35,448][__main__][INFO] - Starting iteration 298. [2025-11-13 09:53:35,451][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:53:35,452][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:53:45,400][__main__][INFO] - Number of regex retries in iteration 298: 0 [2025-11-13 09:53:45,401][__main__][INFO] - agents played in iteration 298 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:53:45,870][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:45,904][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:45,937][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:45,972][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:45,973][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:53:45,973][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:53:46,690][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:53:46,987][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:53:47,314][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:53:47,640][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:53:47,966][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:53:48,293][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:53:48,619][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:53:48,950][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:53:49,280][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:53:49,606][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:53:49,933][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:53:50,260][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:53:50,588][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:53:50,919][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:53:51,248][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:53:51,577][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:53:51,904][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:53:52,231][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:53:52,556][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:53:52,881][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:53:53,207][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:53:53,532][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:53:53,859][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:53:54,183][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:53:54,509][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:53:54,836][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:53:55,161][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:53:55,488][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:53:55,815][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:53:56,146][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:53:56,471][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:53:56,798][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:53:57,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:53:57,827][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:53:58,551][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:53:58,553][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:53:58,554][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:53:59,487][__main__][INFO] - Iteration 299 took 24s (41.39% Gen, 54.72% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 11m 40s. Estimated total time: 20h 1m 51s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 3s, 500 more iterations: 3h 20m 18s. [2025-11-13 09:53:59,489][__main__][INFO] - Starting iteration 299. [2025-11-13 09:53:59,492][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:53:59,492][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:54:09,496][__main__][INFO] - Number of regex retries in iteration 299: 0 [2025-11-13 09:54:09,497][__main__][INFO] - agents played in iteration 299 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:54:09,945][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:09,978][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:10,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:10,044][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:10,045][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:54:10,045][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:54:10,765][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:54:11,063][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:54:11,391][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:54:11,720][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:54:12,051][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:54:12,380][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:54:12,708][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:54:13,041][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:54:13,367][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:54:13,694][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:54:14,023][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:54:14,351][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:54:14,678][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:54:15,006][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:54:15,332][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:54:15,662][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:54:15,991][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:54:16,316][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:54:16,647][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:54:16,976][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:54:17,305][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:54:17,632][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:54:17,960][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:54:18,287][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:54:18,616][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:54:18,944][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:54:19,276][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:54:19,603][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:54:19,931][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:54:20,258][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:54:20,585][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:54:20,913][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:54:21,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:54:21,970][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:54:22,709][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:54:22,710][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:54:22,712][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:54:23,698][__main__][INFO] - Iteration 300 took 24s (41.33% Gen, 54.59% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 19m 46s. Estimated total time: 20h 10m 21s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 20s, 500 more iterations: 3h 21m 43s. [2025-11-13 09:54:23,700][__main__][INFO] - Starting iteration 300. [2025-11-13 09:54:23,703][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:54:23,704][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:54:33,743][__main__][INFO] - Number of regex retries in iteration 300: 0 [2025-11-13 09:54:33,744][__main__][INFO] - agents played in iteration 300 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:54:34,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:34,234][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:34,269][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:34,303][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:34,304][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:54:34,304][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:54:35,021][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:54:35,316][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:54:35,642][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:54:35,971][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:54:36,299][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:54:36,625][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:54:36,952][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:54:37,278][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:54:37,605][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:54:37,932][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:54:38,259][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:54:38,585][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:54:38,912][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:54:39,240][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:54:39,566][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:54:39,890][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:54:40,216][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:54:40,542][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:54:40,867][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:54:41,193][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:54:41,518][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:54:41,844][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:54:42,171][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:54:42,496][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:54:42,824][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:54:43,150][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:54:43,476][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:54:43,803][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:54:44,129][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:54:44,455][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:54:44,781][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:54:45,107][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:54:45,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:54:46,134][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:54:46,858][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:54:46,860][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:54:46,862][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:54:48,662][__main__][INFO] - Iteration 301 took 24s (40.23% Gen, 52.56% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 56m 59s. Estimated total time: 20h 47m 59s. Time estimates for 10 more iterations: 4m 9s, 100 more iterations: 41m 35s, 500 more iterations: 3h 27m 59s. [2025-11-13 09:54:48,668][__main__][INFO] - Starting iteration 301. [2025-11-13 09:54:48,671][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:54:48,671][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:54:58,602][__main__][INFO] - Number of regex retries in iteration 301: 0 [2025-11-13 09:54:58,603][__main__][INFO] - agents played in iteration 301 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:54:59,047][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:59,081][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:59,114][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:59,147][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:59,148][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:54:59,148][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:54:59,851][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:55:00,148][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:55:00,474][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:55:00,800][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:55:01,127][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:55:01,458][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:55:01,786][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:55:02,113][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:55:02,440][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:55:02,766][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:55:03,094][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:55:03,422][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:55:03,749][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:55:04,076][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:55:04,402][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:55:04,729][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:55:05,056][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:55:05,384][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:55:05,710][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:55:06,036][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:55:06,362][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:55:06,690][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:55:07,021][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:55:07,348][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:55:07,673][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:55:07,999][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:55:08,324][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:55:08,650][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:55:08,978][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:55:09,305][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:55:09,632][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:55:09,961][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:55:10,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:55:11,002][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:55:11,702][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:55:11,703][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:55:11,704][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:55:12,639][__main__][INFO] - Iteration 302 took 23s (41.43% Gen, 54.66% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 7m 4s. Estimated total time: 19h 58m 28s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 56s, 500 more iterations: 3h 19m 44s. [2025-11-13 09:55:12,642][__main__][INFO] - Starting iteration 302. [2025-11-13 09:55:12,644][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:55:12,645][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:55:21,545][__main__][INFO] - Number of regex retries in iteration 302: 0 [2025-11-13 09:55:21,546][__main__][INFO] - agents played in iteration 302 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:55:22,007][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:22,041][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:22,074][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:22,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:22,107][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:55:22,108][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:55:22,817][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:55:23,112][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:55:23,440][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:55:23,769][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:55:24,095][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:55:24,421][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:55:24,747][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:55:25,074][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:55:25,402][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:55:25,730][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:55:26,058][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:55:26,386][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:55:26,712][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:55:27,042][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:55:27,369][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:55:27,701][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:55:28,028][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:55:28,354][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:55:28,679][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:55:29,006][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:55:29,333][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:55:29,660][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:55:29,988][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:55:30,319][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:55:30,647][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:55:30,972][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:55:31,297][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:55:31,624][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:55:31,951][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:55:32,277][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:55:32,603][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:55:32,931][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:55:33,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:55:33,974][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:55:34,665][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:55:34,666][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:55:34,668][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:55:35,563][__main__][INFO] - Iteration 303 took 22s (38.84% Gen, 57.25% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 14m 12s. Estimated total time: 19h 5m 58s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 11s, 500 more iterations: 3h 10m 59s. [2025-11-13 09:55:35,565][__main__][INFO] - Starting iteration 303. [2025-11-13 09:55:35,568][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:55:35,569][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:55:45,146][__main__][INFO] - Number of regex retries in iteration 303: 0 [2025-11-13 09:55:45,146][__main__][INFO] - agents played in iteration 303 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:55:45,591][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:45,630][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:45,664][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:45,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:45,698][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:55:45,699][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:55:46,404][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:55:46,700][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:55:47,026][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:55:47,354][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:55:47,681][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:55:48,007][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:55:48,338][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:55:48,670][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:55:48,999][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:55:49,327][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:55:49,653][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:55:49,982][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:55:50,311][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:55:50,638][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:55:50,965][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:55:51,292][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:55:51,617][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:55:51,944][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:55:52,272][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:55:52,597][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:55:52,924][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:55:53,250][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:55:53,577][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:55:53,904][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:55:54,233][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:55:54,560][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:55:54,885][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:55:55,215][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:55:55,545][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:55:55,876][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:55:56,204][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:55:56,535][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:55:56,861][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:55:57,579][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:55:58,272][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:55:58,274][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:55:58,275][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:55:59,288][__main__][INFO] - Iteration 304 took 23s (40.37% Gen, 55.35% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 53m 55s. Estimated total time: 19h 46m 5s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 32s, 500 more iterations: 3h 17m 40s. [2025-11-13 09:55:59,291][__main__][INFO] - Starting iteration 304. [2025-11-13 09:55:59,295][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:55:59,295][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:56:09,194][__main__][INFO] - Number of regex retries in iteration 304: 0 [2025-11-13 09:56:09,194][__main__][INFO] - agents played in iteration 304 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:56:09,639][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:09,673][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:09,706][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:09,739][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:09,739][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:56:09,740][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:56:10,441][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:56:10,738][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:56:11,064][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:56:11,390][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:56:11,716][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:56:12,043][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:56:12,371][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:56:12,698][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:56:13,030][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:56:13,358][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:56:13,687][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:56:14,012][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:56:14,340][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:56:14,667][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:56:14,991][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:56:15,319][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:56:15,644][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:56:15,972][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:56:16,297][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:56:16,624][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:56:16,950][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:56:17,276][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:56:17,604][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:56:17,930][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:56:18,256][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:56:18,584][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:56:18,911][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:56:19,237][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:56:19,564][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:56:19,890][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:56:20,218][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:56:20,545][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:56:20,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:56:21,582][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:56:22,281][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:56:22,283][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:56:22,285][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:56:23,259][__main__][INFO] - Iteration 305 took 23s (41.30% Gen, 54.62% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 5m 42s. Estimated total time: 19h 58m 16s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 56s, 500 more iterations: 3h 19m 42s. [2025-11-13 09:56:23,261][__main__][INFO] - Starting iteration 305. [2025-11-13 09:56:23,264][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:56:23,265][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:56:32,701][__main__][INFO] - Number of regex retries in iteration 305: 0 [2025-11-13 09:56:32,702][__main__][INFO] - agents played in iteration 305 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:56:33,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:33,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:33,211][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:33,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:33,244][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:56:33,245][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:56:33,945][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:56:34,241][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:56:34,567][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:56:34,894][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:56:35,221][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:56:35,548][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:56:35,874][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:56:36,200][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:56:36,528][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:56:36,855][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:56:37,182][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:56:37,508][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:56:37,834][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:56:38,161][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:56:38,488][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:56:38,815][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:56:39,140][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:56:39,466][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:56:39,791][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:56:40,116][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:56:40,442][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:56:40,769][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:56:41,101][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:56:41,431][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:56:41,759][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:56:42,084][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:56:42,413][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:56:42,739][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:56:43,066][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:56:43,395][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:56:43,721][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:56:44,049][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:56:44,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:56:45,095][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:56:45,789][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:56:45,791][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:56:45,792][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:56:46,897][__main__][INFO] - Iteration 306 took 23s (39.92% Gen, 55.39% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 48m 43s. Estimated total time: 19h 41m 41s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 23s, 500 more iterations: 3h 16m 56s. [2025-11-13 09:56:46,899][__main__][INFO] - Starting iteration 306. [2025-11-13 09:56:46,902][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:56:46,903][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:56:55,593][__main__][INFO] - Number of regex retries in iteration 306: 0 [2025-11-13 09:56:55,594][__main__][INFO] - agents played in iteration 306 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:56:56,039][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:56,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:56,106][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:56,140][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:56,140][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:56:56,141][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:56:56,870][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:56:57,167][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:56:57,494][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:56:57,821][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:56:58,147][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:56:58,473][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:56:58,799][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:56:59,125][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:56:59,451][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:56:59,778][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:57:00,104][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:57:00,431][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:57:00,763][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:57:01,094][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:57:01,421][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:57:01,749][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:57:02,074][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:57:02,401][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:57:02,728][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:57:03,055][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:57:03,381][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:57:03,709][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:57:04,037][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:57:04,363][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:57:04,691][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:57:05,018][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:57:05,347][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:57:05,672][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:57:06,000][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:57:06,327][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:57:06,653][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:57:06,979][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:57:07,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:57:08,029][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:57:08,724][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:57:08,729][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:57:08,731][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:57:09,636][__main__][INFO] - Iteration 307 took 22s (38.22% Gen, 57.78% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 3m 23s. Estimated total time: 18h 56m 44s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 53s, 500 more iterations: 3h 9m 27s. [2025-11-13 09:57:09,638][__main__][INFO] - Starting iteration 307. [2025-11-13 09:57:09,640][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:57:09,641][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:57:19,163][__main__][INFO] - Number of regex retries in iteration 307: 0 [2025-11-13 09:57:19,163][__main__][INFO] - agents played in iteration 307 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:57:19,606][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:19,640][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:19,673][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:19,706][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:19,707][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:57:19,708][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:57:20,427][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:57:20,722][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:57:21,049][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:57:21,376][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:57:21,703][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:57:22,029][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:57:22,357][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:57:22,683][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:57:23,010][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:57:23,339][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:57:23,671][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:57:23,996][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:57:24,325][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:57:24,653][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:57:24,980][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:57:25,309][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:57:25,638][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:57:25,964][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:57:26,290][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:57:26,619][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:57:26,948][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:57:27,274][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:57:27,602][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:57:27,927][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:57:28,252][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:57:28,579][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:57:28,905][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:57:29,234][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:57:29,561][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:57:29,887][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:57:30,214][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:57:30,540][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:57:30,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:57:31,572][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:57:32,265][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:57:32,267][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:57:32,269][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:57:33,393][__main__][INFO] - Iteration 308 took 23s (40.09% Gen, 55.18% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 53m 56s. Estimated total time: 19h 47m 40s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 35s, 500 more iterations: 3h 17m 56s. [2025-11-13 09:57:33,395][__main__][INFO] - Starting iteration 308. [2025-11-13 09:57:33,398][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:57:33,399][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:57:42,631][__main__][INFO] - Number of regex retries in iteration 308: 0 [2025-11-13 09:57:42,632][__main__][INFO] - agents played in iteration 308 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:57:43,070][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:43,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:43,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:43,172][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:43,172][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:57:43,173][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:57:43,903][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:57:44,200][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:57:44,528][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:57:44,859][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:57:45,189][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:57:45,516][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:57:45,844][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:57:46,171][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:57:46,497][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:57:46,825][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:57:47,152][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:57:47,479][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:57:47,806][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:57:48,133][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:57:48,460][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:57:48,787][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:57:49,114][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:57:49,442][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:57:49,771][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:57:50,102][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:57:50,429][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:57:50,755][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:57:51,082][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:57:51,409][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:57:51,738][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:57:52,067][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:57:52,393][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:57:52,720][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:57:53,047][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:57:53,375][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:57:53,702][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:57:54,030][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:57:54,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:57:55,068][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:57:55,760][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:57:55,762][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:57:55,763][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:57:56,671][__main__][INFO] - Iteration 309 took 23s (39.67% Gen, 56.42% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 29m 32s. Estimated total time: 19h 23m 39s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 47s, 500 more iterations: 3h 13m 56s. [2025-11-13 09:57:56,673][__main__][INFO] - Starting iteration 309. [2025-11-13 09:57:56,676][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:57:56,677][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:58:05,404][__main__][INFO] - Number of regex retries in iteration 309: 0 [2025-11-13 09:58:05,405][__main__][INFO] - agents played in iteration 309 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:58:05,845][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:05,878][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:05,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:05,945][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:05,946][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:58:05,946][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:58:06,672][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:58:06,969][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:58:07,301][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:58:07,628][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:58:07,955][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:58:08,284][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:58:08,616][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:58:08,947][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:58:09,272][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:58:09,601][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:58:09,928][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:58:10,255][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:58:10,585][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:58:10,919][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:58:11,247][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:58:11,573][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:58:11,901][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:58:12,228][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:58:12,555][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:58:12,883][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:58:13,214][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:58:13,544][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:58:13,871][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:58:14,196][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:58:14,524][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:58:14,851][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:58:15,177][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:58:15,504][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:58:15,831][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:58:16,157][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:58:16,484][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:58:16,815][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:58:17,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:58:17,853][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:58:18,542][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:58:18,544][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:58:18,545][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:58:19,658][__main__][INFO] - Iteration 310 took 22s (37.98% Gen, 57.18% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 14m 36s. Estimated total time: 19h 9m 7s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 18s, 500 more iterations: 3h 11m 31s. [2025-11-13 09:58:19,660][__main__][INFO] - Starting iteration 310. [2025-11-13 09:58:19,662][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:58:19,663][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:58:28,701][__main__][INFO] - Number of regex retries in iteration 310: 0 [2025-11-13 09:58:28,702][__main__][INFO] - agents played in iteration 310 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:58:29,140][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:29,173][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:29,208][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:29,242][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:29,243][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:58:29,243][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:58:29,974][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:58:30,271][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:58:30,599][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:58:30,926][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:58:31,253][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:58:31,580][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:58:31,907][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:58:32,233][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:58:32,559][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:58:32,887][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:58:33,213][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:58:33,539][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:58:33,868][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:58:34,197][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:58:34,527][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:58:34,856][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:58:35,189][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:58:35,517][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:58:35,849][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:58:36,183][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:58:36,511][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:58:36,838][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:58:37,165][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:58:37,491][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:58:37,818][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:58:38,151][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:58:38,481][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:58:38,810][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:58:39,136][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:58:39,466][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:58:39,794][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:58:40,123][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:58:40,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:58:41,169][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:58:41,858][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:58:41,860][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:58:41,862][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:58:43,923][__main__][INFO] - Iteration 311 took 24s (37.25% Gen, 54.24% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 18m 11s. Estimated total time: 20h 13m 5s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 26s, 500 more iterations: 3h 22m 10s. [2025-11-13 09:58:43,939][__main__][INFO] - Starting iteration 311. [2025-11-13 09:58:43,942][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 09:58:43,942][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:58:52,779][__main__][INFO] - Number of regex retries in iteration 311: 0 [2025-11-13 09:58:52,780][__main__][INFO] - agents played in iteration 311 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:58:53,223][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:53,258][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:53,292][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:53,326][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:53,327][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:58:53,327][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:58:54,048][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:58:54,345][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:58:54,673][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:58:55,001][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:58:55,329][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:58:55,657][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:58:55,984][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:58:56,310][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:58:56,636][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:58:56,965][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:58:57,294][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:58:57,622][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:58:57,949][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:58:58,277][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:58:58,603][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:58:58,932][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:58:59,262][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:58:59,591][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:58:59,916][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:59:00,245][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:59:00,572][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:59:00,903][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:59:01,235][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:59:01,563][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:59:01,891][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:59:02,218][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:59:02,544][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:59:02,870][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:59:03,198][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:59:03,524][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:59:03,850][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:59:04,177][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:59:04,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:59:05,234][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:59:05,926][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:59:05,927][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:59:05,929][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:59:06,845][__main__][INFO] - Iteration 312 took 22s (38.58% Gen, 57.41% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 9m 53s. Estimated total time: 19h 5m 11s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 10s, 500 more iterations: 3h 10m 51s. [2025-11-13 09:59:06,847][__main__][INFO] - Starting iteration 312. [2025-11-13 09:59:06,850][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 09:59:06,851][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:59:15,626][__main__][INFO] - Number of regex retries in iteration 312: 0 [2025-11-13 09:59:15,627][__main__][INFO] - agents played in iteration 312 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:59:16,066][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:16,100][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:16,133][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:16,167][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:16,167][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:59:16,168][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:59:16,890][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:59:17,188][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:59:17,517][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:59:17,843][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:59:18,171][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:59:18,497][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:59:18,824][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:59:19,151][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:59:19,478][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:59:19,805][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:59:20,132][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:59:20,458][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:59:20,788][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:59:21,117][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:59:21,449][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:59:21,778][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:59:22,112][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:59:22,442][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:59:22,773][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:59:23,098][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:59:23,425][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:59:23,752][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:59:24,079][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:59:24,409][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:59:24,738][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:59:25,069][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:59:25,398][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:59:25,730][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:59:26,059][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:59:26,387][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:59:26,715][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:59:27,042][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:59:27,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:59:28,070][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:59:28,769][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:59:28,771][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:59:28,773][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:59:29,673][__main__][INFO] - Iteration 313 took 22s (38.45% Gen, 57.60% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 5m 31s. Estimated total time: 19h 1m 12s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 2s, 500 more iterations: 3h 10m 12s. [2025-11-13 09:59:29,675][__main__][INFO] - Starting iteration 313. [2025-11-13 09:59:29,679][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 09:59:29,679][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:59:39,112][__main__][INFO] - Number of regex retries in iteration 313: 0 [2025-11-13 09:59:39,112][__main__][INFO] - agents played in iteration 313 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 09:59:39,558][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:39,592][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:39,626][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:39,659][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:39,660][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:59:39,661][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:59:40,385][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:59:40,682][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:59:41,009][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:59:41,338][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:59:41,665][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:59:41,992][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:59:42,318][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:59:42,645][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:59:42,972][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:59:43,299][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:59:43,626][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:59:43,954][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:59:44,281][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:59:44,609][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:59:44,936][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:59:45,263][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:59:45,590][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:59:45,918][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:59:46,243][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:59:46,569][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:59:46,896][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:59:47,223][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:59:47,549][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:59:47,875][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:59:48,204][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:59:48,531][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:59:48,860][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:59:49,192][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:59:49,521][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:59:49,851][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:59:50,177][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:59:50,504][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:59:50,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:59:51,533][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:59:52,221][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:59:52,222][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:59:52,224][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:59:53,123][__main__][INFO] - Iteration 314 took 23s (40.23% Gen, 55.92% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 36m 13s. Estimated total time: 19h 32m 17s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 4s, 500 more iterations: 3h 15m 22s. [2025-11-13 09:59:53,125][__main__][INFO] - Starting iteration 314. [2025-11-13 09:59:53,128][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 09:59:53,129][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:00:01,671][__main__][INFO] - Number of regex retries in iteration 314: 0 [2025-11-13 10:00:01,672][__main__][INFO] - agents played in iteration 314 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:00:02,111][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:02,145][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:02,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:02,212][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:02,213][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:00:02,213][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:00:02,957][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:00:03,255][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:00:03,583][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:00:03,910][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:00:04,237][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:00:04,564][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:00:04,893][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:00:05,223][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:00:05,552][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:00:05,879][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:00:06,206][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:00:06,534][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:00:06,859][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:00:07,186][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:00:07,518][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:00:07,847][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:00:08,174][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:00:08,501][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:00:08,831][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:00:09,157][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:00:09,489][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:00:09,819][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:00:10,145][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:00:10,471][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:00:10,798][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:00:11,124][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:00:11,452][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:00:11,780][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:00:12,107][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:00:12,435][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:00:12,767][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:00:13,094][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:00:13,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:00:14,136][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:00:14,830][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:00:14,832][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:00:14,834][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:00:15,771][__main__][INFO] - Iteration 315 took 22s (37.73% Gen, 58.13% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 55m 44s. Estimated total time: 18h 52m 11s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 44s, 500 more iterations: 3h 8m 41s. [2025-11-13 10:00:15,773][__main__][INFO] - Starting iteration 315. [2025-11-13 10:00:15,775][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:00:15,776][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:00:24,820][__main__][INFO] - Number of regex retries in iteration 315: 0 [2025-11-13 10:00:24,821][__main__][INFO] - agents played in iteration 315 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:00:25,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:25,295][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:25,328][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:25,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:25,363][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:00:25,364][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:00:26,090][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:00:26,388][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:00:26,716][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:00:27,042][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:00:27,368][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:00:27,698][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:00:28,025][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:00:28,352][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:00:28,681][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:00:29,011][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:00:29,336][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:00:29,664][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:00:29,992][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:00:30,320][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:00:30,649][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:00:30,975][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:00:31,303][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:00:31,634][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:00:31,961][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:00:32,292][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:00:32,620][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:00:32,947][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:00:33,274][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:00:33,601][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:00:33,929][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:00:34,256][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:00:34,581][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:00:34,909][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:00:35,239][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:00:35,566][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:00:35,896][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:00:36,225][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:00:36,553][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:00:37,290][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:00:37,983][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:00:37,987][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:00:37,988][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:00:38,909][__main__][INFO] - Iteration 316 took 23s (39.09% Gen, 56.92% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 19m 55s. Estimated total time: 19h 16m 45s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 33s, 500 more iterations: 3h 12m 47s. [2025-11-13 10:00:38,911][__main__][INFO] - Starting iteration 316. [2025-11-13 10:00:38,915][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:00:38,915][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:00:48,025][__main__][INFO] - Number of regex retries in iteration 316: 0 [2025-11-13 10:00:48,025][__main__][INFO] - agents played in iteration 316 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:00:48,466][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:48,500][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:48,533][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:48,567][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:48,568][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:00:48,568][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:00:49,297][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:00:49,593][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:00:49,920][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:00:50,248][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:00:50,575][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:00:50,902][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:00:51,230][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:00:51,557][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:00:51,884][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:00:52,210][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:00:52,536][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:00:52,864][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:00:53,195][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:00:53,522][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:00:53,850][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:00:54,179][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:00:54,505][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:00:54,834][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:00:55,159][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:00:55,487][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:00:55,814][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:00:56,140][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:00:56,468][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:00:56,800][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:00:57,129][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:00:57,458][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:00:57,787][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:00:58,119][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:00:58,450][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:00:58,776][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:00:59,103][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:00:59,430][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:00:59,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:01:00,486][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:01:01,176][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:01:01,180][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:01:01,182][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:01:02,086][__main__][INFO] - Iteration 317 took 23s (39.31% Gen, 56.77% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 21m 23s. Estimated total time: 19h 18m 36s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 37s, 500 more iterations: 3h 13m 6s. [2025-11-13 10:01:02,088][__main__][INFO] - Starting iteration 317. [2025-11-13 10:01:02,091][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:01:02,091][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:01:10,911][__main__][INFO] - Number of regex retries in iteration 317: 0 [2025-11-13 10:01:10,912][__main__][INFO] - agents played in iteration 317 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:01:11,359][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:11,392][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:11,426][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:11,460][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:11,460][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:01:11,461][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:01:12,184][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:01:12,481][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:01:12,809][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:01:13,137][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:01:13,463][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:01:13,789][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:01:14,117][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:01:14,444][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:01:14,775][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:01:15,101][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:01:15,427][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:01:15,753][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:01:16,082][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:01:16,408][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:01:16,735][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:01:17,061][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:01:17,388][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:01:17,713][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:01:18,041][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:01:18,368][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:01:18,697][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:01:19,025][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:01:19,354][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:01:19,686][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:01:20,015][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:01:20,343][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:01:20,671][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:01:21,002][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:01:21,332][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:01:21,658][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:01:21,985][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:01:22,313][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:01:22,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:01:23,356][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:01:24,052][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:01:24,055][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:01:24,056][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:01:24,972][__main__][INFO] - Iteration 318 took 22s (38.55% Gen, 57.44% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 6m 30s. Estimated total time: 19h 4m 6s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 8s, 500 more iterations: 3h 10m 41s. [2025-11-13 10:01:24,974][__main__][INFO] - Starting iteration 318. [2025-11-13 10:01:24,977][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:01:24,977][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:01:34,308][__main__][INFO] - Number of regex retries in iteration 318: 0 [2025-11-13 10:01:34,309][__main__][INFO] - agents played in iteration 318 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:01:34,753][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:34,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:34,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:34,854][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:34,854][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:01:34,854][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:01:35,589][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:01:35,899][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:01:36,227][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:01:36,556][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:01:36,881][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:01:37,208][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:01:37,535][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:01:37,862][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:01:38,189][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:01:38,515][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:01:38,844][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:01:39,171][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:01:39,498][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:01:39,824][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:01:40,153][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:01:40,479][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:01:40,805][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:01:41,131][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:01:41,459][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:01:41,786][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:01:42,113][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:01:42,442][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:01:42,771][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:01:43,105][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:01:43,435][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:01:43,766][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:01:44,098][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:01:44,429][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:01:44,761][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:01:45,088][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:01:45,414][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:01:45,741][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:01:46,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:01:46,784][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:01:47,476][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:01:47,478][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:01:47,479][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:01:48,368][__main__][INFO] - Iteration 319 took 23s (39.89% Gen, 56.30% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 31m 37s. Estimated total time: 19h 29m 36s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 59s, 500 more iterations: 3h 14m 56s. [2025-11-13 10:01:48,370][__main__][INFO] - Starting iteration 319. [2025-11-13 10:01:48,375][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:01:48,376][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:01:57,913][__main__][INFO] - Number of regex retries in iteration 319: 0 [2025-11-13 10:01:57,913][__main__][INFO] - agents played in iteration 319 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:01:58,355][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:58,390][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:58,423][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:58,456][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:58,457][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:01:58,457][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:01:59,181][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:01:59,478][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:01:59,805][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:02:00,133][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:02:00,461][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:02:00,787][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:02:01,114][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:02:01,440][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:02:01,768][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:02:02,094][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:02:02,422][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:02:02,753][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:02:03,085][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:02:03,410][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:02:03,737][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:02:04,065][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:02:04,394][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:02:04,722][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:02:05,050][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:02:05,382][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:02:05,712][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:02:06,038][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:02:06,365][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:02:06,697][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:02:07,026][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:02:07,358][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:02:07,684][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:02:08,010][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:02:08,337][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:02:08,664][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:02:08,991][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:02:09,319][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:02:09,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:02:10,361][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:02:11,054][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:02:11,058][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:02:11,059][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:02:12,026][__main__][INFO] - Iteration 320 took 23s (40.32% Gen, 55.58% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 44m 19s. Estimated total time: 19h 42m 42s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 25s, 500 more iterations: 3h 17m 7s. [2025-11-13 10:02:12,028][__main__][INFO] - Starting iteration 320. [2025-11-13 10:02:12,031][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:02:12,031][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:02:20,868][__main__][INFO] - Number of regex retries in iteration 320: 0 [2025-11-13 10:02:20,869][__main__][INFO] - agents played in iteration 320 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:02:21,311][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:21,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:21,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:21,413][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:21,413][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:02:21,413][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:02:22,141][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:02:22,438][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:02:22,767][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:02:23,094][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:02:23,421][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:02:23,747][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:02:24,074][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:02:24,401][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:02:24,727][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:02:25,055][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:02:25,381][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:02:25,708][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:02:26,034][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:02:26,363][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:02:26,694][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:02:27,021][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:02:27,348][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:02:27,675][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:02:28,001][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:02:28,327][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:02:28,655][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:02:28,981][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:02:29,309][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:02:29,636][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:02:29,968][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:02:30,300][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:02:30,629][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:02:30,957][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:02:31,285][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:02:31,611][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:02:31,938][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:02:32,267][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:02:32,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:02:33,306][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:02:34,072][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:02:34,073][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:02:34,075][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:02:35,812][__main__][INFO] - Iteration 321 took 23s (37.16% Gen, 55.53% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 50m 18s. Estimated total time: 19h 49m 5s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 38s, 500 more iterations: 3h 18m 10s. [2025-11-13 10:02:35,814][__main__][INFO] - Starting iteration 321. [2025-11-13 10:02:35,817][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:02:35,818][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:02:44,819][__main__][INFO] - Number of regex retries in iteration 321: 0 [2025-11-13 10:02:44,819][__main__][INFO] - agents played in iteration 321 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:02:45,264][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:45,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:45,331][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:45,365][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:45,366][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:02:45,367][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:02:46,109][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:02:46,407][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:02:46,733][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:02:47,060][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:02:47,387][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:02:47,713][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:02:48,040][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:02:48,369][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:02:48,696][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:02:49,022][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:02:49,348][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:02:49,675][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:02:50,002][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:02:50,331][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:02:50,657][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:02:50,985][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:02:51,313][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:02:51,639][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:02:51,966][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:02:52,293][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:02:52,619][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:02:52,948][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:02:53,278][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:02:53,611][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:02:53,942][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:02:54,268][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:02:54,597][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:02:54,926][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:02:55,253][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:02:55,580][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:02:55,907][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:02:56,233][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:02:56,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:02:57,274][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:02:57,960][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:02:57,969][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:02:57,970][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:02:58,978][__main__][INFO] - Iteration 322 took 23s (38.86% Gen, 56.78% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 18m 57s. Estimated total time: 19h 18m 7s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 36s, 500 more iterations: 3h 13m 1s. [2025-11-13 10:02:58,980][__main__][INFO] - Starting iteration 322. [2025-11-13 10:02:58,983][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:02:58,984][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:03:07,869][__main__][INFO] - Number of regex retries in iteration 322: 0 [2025-11-13 10:03:07,870][__main__][INFO] - agents played in iteration 322 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:03:08,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:08,344][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:08,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:08,411][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:08,412][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:03:08,412][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:03:09,134][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:03:09,432][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:03:09,759][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:03:10,086][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:03:10,412][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:03:10,740][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:03:11,065][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:03:11,391][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:03:11,720][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:03:12,047][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:03:12,374][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:03:12,700][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:03:13,028][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:03:13,354][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:03:13,682][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:03:14,007][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:03:14,334][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:03:14,663][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:03:14,988][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:03:15,315][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:03:15,643][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:03:15,971][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:03:16,299][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:03:16,626][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:03:16,952][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:03:17,280][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:03:17,608][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:03:17,937][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:03:18,264][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:03:18,590][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:03:18,916][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:03:19,242][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:03:19,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:03:20,282][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:03:20,970][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:03:20,973][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:03:20,975][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:03:21,860][__main__][INFO] - Iteration 323 took 22s (38.84% Gen, 57.28% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 4m 22s. Estimated total time: 19h 3m 54s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 7s, 500 more iterations: 3h 10m 39s. [2025-11-13 10:03:21,862][__main__][INFO] - Starting iteration 323. [2025-11-13 10:03:21,866][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:03:21,866][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:03:30,268][__main__][INFO] - Number of regex retries in iteration 323: 0 [2025-11-13 10:03:30,268][__main__][INFO] - agents played in iteration 323 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:03:30,714][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:30,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:30,783][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:30,817][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:30,818][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:03:30,818][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:03:31,562][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:03:31,858][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:03:32,184][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:03:32,510][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:03:32,838][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:03:33,165][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:03:33,492][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:03:33,821][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:03:34,148][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:03:34,475][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:03:34,802][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:03:35,130][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:03:35,458][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:03:35,785][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:03:36,113][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:03:36,440][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:03:36,768][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:03:37,095][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:03:37,424][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:03:37,750][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:03:38,076][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:03:38,403][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:03:38,730][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:03:39,056][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:03:39,386][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:03:39,713][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:03:40,039][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:03:40,372][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:03:40,700][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:03:41,026][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:03:41,354][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:03:41,682][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:03:42,010][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:03:42,731][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:03:43,426][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:03:43,429][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:03:43,431][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:03:44,322][__main__][INFO] - Iteration 324 took 22s (37.41% Gen, 58.61% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 42m 56s. Estimated total time: 18h 42m 51s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 25s, 500 more iterations: 3h 7m 8s. [2025-11-13 10:03:44,324][__main__][INFO] - Starting iteration 324. [2025-11-13 10:03:44,327][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:03:44,327][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:03:53,994][__main__][INFO] - Number of regex retries in iteration 324: 0 [2025-11-13 10:03:53,995][__main__][INFO] - agents played in iteration 324 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:03:54,439][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:54,473][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:54,506][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:54,541][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:54,542][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:03:54,542][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:03:55,279][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:03:55,577][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:03:55,905][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:03:56,231][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:03:56,558][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:03:56,885][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:03:57,214][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:03:57,540][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:03:57,868][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:03:58,195][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:03:58,522][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:03:58,849][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:03:59,175][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:03:59,503][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:03:59,830][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:04:00,157][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:04:00,483][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:04:00,810][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:04:01,138][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:04:01,465][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:04:01,792][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:04:02,119][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:04:02,452][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:04:02,780][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:04:03,107][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:04:03,432][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:04:03,760][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:04:04,088][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:04:04,415][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:04:04,742][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:04:05,070][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:04:05,397][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:04:05,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:04:06,443][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:04:07,134][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:04:07,136][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:04:07,137][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:04:08,273][__main__][INFO] - Iteration 325 took 23s (40.37% Gen, 54.88% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 57m 3s. Estimated total time: 19h 57m 22s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 54s, 500 more iterations: 3h 19m 33s. [2025-11-13 10:04:08,275][__main__][INFO] - Starting iteration 325. [2025-11-13 10:04:08,278][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:04:08,279][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:04:17,387][__main__][INFO] - Number of regex retries in iteration 325: 0 [2025-11-13 10:04:17,387][__main__][INFO] - agents played in iteration 325 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:04:17,840][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:17,875][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:17,908][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:17,943][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:17,943][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:04:17,944][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:04:18,676][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:04:18,974][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:04:19,301][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:04:19,628][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:04:19,955][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:04:20,281][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:04:20,607][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:04:20,934][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:04:21,261][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:04:21,587][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:04:21,915][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:04:22,242][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:04:22,570][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:04:22,897][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:04:23,224][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:04:23,551][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:04:23,878][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:04:24,204][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:04:24,536][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:04:24,865][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:04:25,191][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:04:25,520][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:04:25,853][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:04:26,182][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:04:26,510][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:04:26,841][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:04:27,168][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:04:27,496][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:04:27,823][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:04:28,151][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:04:28,478][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:04:28,811][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:04:29,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:04:29,860][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:04:30,546][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:04:30,547][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:04:30,549][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:04:31,513][__main__][INFO] - Iteration 326 took 23s (39.20% Gen, 56.64% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 21m 3s. Estimated total time: 19h 21m 46s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 43s, 500 more iterations: 3h 13m 37s. [2025-11-13 10:04:31,515][__main__][INFO] - Starting iteration 326. [2025-11-13 10:04:31,518][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:04:31,519][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:04:40,441][__main__][INFO] - Number of regex retries in iteration 326: 0 [2025-11-13 10:04:40,442][__main__][INFO] - agents played in iteration 326 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:04:40,884][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:40,921][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:40,955][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:40,989][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:40,989][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:04:40,990][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:04:41,717][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:04:42,015][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:04:42,342][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:04:42,668][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:04:42,995][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:04:43,321][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:04:43,648][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:04:43,974][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:04:44,303][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:04:44,628][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:04:44,954][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:04:45,280][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:04:45,608][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:04:45,935][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:04:46,262][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:04:46,588][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:04:46,915][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:04:47,241][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:04:47,570][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:04:47,897][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:04:48,223][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:04:48,550][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:04:48,876][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:04:49,205][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:04:49,539][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:04:49,868][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:04:50,195][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:04:50,521][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:04:50,849][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:04:51,176][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:04:51,503][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:04:51,833][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:04:52,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:04:52,904][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:04:53,605][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:04:53,607][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:04:53,609][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:04:54,544][__main__][INFO] - Iteration 327 took 23s (38.75% Gen, 57.18% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 10m 16s. Estimated total time: 19h 11m 21s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 22s, 500 more iterations: 3h 11m 53s. [2025-11-13 10:04:54,546][__main__][INFO] - Starting iteration 327. [2025-11-13 10:04:54,549][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:04:54,550][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:05:03,533][__main__][INFO] - Number of regex retries in iteration 327: 0 [2025-11-13 10:05:03,533][__main__][INFO] - agents played in iteration 327 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:05:03,974][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:04,007][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:04,042][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:04,077][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:04,077][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:05:04,078][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:05:04,810][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:05:05,107][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:05:05,434][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:05:05,762][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:05:06,088][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:05:06,414][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:05:06,740][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:05:07,066][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:05:07,393][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:05:07,719][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:05:08,045][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:05:08,372][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:05:08,698][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:05:09,024][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:05:09,350][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:05:09,677][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:05:10,004][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:05:10,331][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:05:10,659][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:05:10,985][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:05:11,312][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:05:11,642][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:05:11,973][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:05:12,301][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:05:12,629][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:05:12,958][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:05:13,286][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:05:13,612][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:05:13,939][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:05:14,266][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:05:14,592][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:05:14,921][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:05:15,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:05:15,975][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:05:16,663][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:05:16,665][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:05:16,666][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:05:17,582][__main__][INFO] - Iteration 328 took 23s (39.00% Gen, 57.02% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 10m 13s. Estimated total time: 19h 11m 41s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 23s, 500 more iterations: 3h 11m 56s. [2025-11-13 10:05:17,585][__main__][INFO] - Starting iteration 328. [2025-11-13 10:05:17,588][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:05:17,588][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:05:23,193][mllm.models.large_language_model_local][WARNING] - Response did not match regex: (|), retry 1/1 [2025-11-13 10:05:26,597][__main__][INFO] - Number of regex retries in iteration 328: 1 [2025-11-13 10:05:26,597][__main__][INFO] - agents played in iteration 328 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:05:27,039][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:27,404][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:27,438][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:27,473][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:27,473][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:05:27,473][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:05:28,200][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:05:28,497][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:05:28,825][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:05:29,152][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:05:29,479][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:05:29,805][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:05:30,132][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:05:30,459][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:05:30,786][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:05:31,113][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:05:31,439][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:05:31,767][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:05:32,093][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:05:32,420][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:05:32,746][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:05:33,073][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:05:33,400][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:05:33,728][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:05:34,056][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:05:34,383][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:05:34,710][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:05:35,037][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:05:35,366][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:05:35,693][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:05:36,020][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:05:36,348][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:05:36,675][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:05:37,002][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:05:37,331][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:05:37,658][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:05:37,985][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:05:38,313][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:05:38,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:05:39,357][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:05:40,047][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:05:40,048][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:05:40,050][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:05:40,950][__main__][INFO] - Iteration 329 took 23s (38.56% Gen, 57.58% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 26m 17s. Estimated total time: 19h 28m 8s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 56s, 500 more iterations: 3h 14m 41s. [2025-11-13 10:05:40,952][__main__][INFO] - Starting iteration 329. [2025-11-13 10:05:40,955][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:05:40,956][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:05:50,206][__main__][INFO] - Number of regex retries in iteration 329: 0 [2025-11-13 10:05:50,206][__main__][INFO] - agents played in iteration 329 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:05:50,646][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:50,683][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:50,717][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:50,751][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:50,751][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:05:50,751][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:05:51,486][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:05:51,784][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:05:52,112][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:05:52,439][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:05:52,766][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:05:53,093][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:05:53,419][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:05:53,745][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:05:54,072][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:05:54,399][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:05:54,725][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:05:55,051][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:05:55,377][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:05:55,703][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:05:56,030][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:05:56,357][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:05:56,684][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:05:57,011][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:05:57,339][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:05:57,665][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:05:57,994][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:05:58,321][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:05:58,653][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:05:58,985][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:05:59,313][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:05:59,640][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:05:59,967][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:06:00,293][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:06:00,620][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:06:00,949][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:06:01,275][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:06:01,603][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:06:01,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:06:02,687][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:06:03,403][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:06:03,404][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:06:03,406][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:06:04,379][__main__][INFO] - Iteration 330 took 23s (39.49% Gen, 56.35% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 28m 58s. Estimated total time: 19h 31m 13s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 2s, 500 more iterations: 3h 15m 12s. [2025-11-13 10:06:04,381][__main__][INFO] - Starting iteration 330. [2025-11-13 10:06:04,384][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:06:04,385][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:06:13,627][__main__][INFO] - Number of regex retries in iteration 330: 0 [2025-11-13 10:06:13,627][__main__][INFO] - agents played in iteration 330 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:06:14,076][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:14,110][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:14,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:14,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:14,178][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:06:14,179][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:06:14,903][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:06:15,199][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:06:15,532][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:06:15,860][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:06:16,189][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:06:16,518][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:06:16,845][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:06:17,171][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:06:17,500][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:06:17,827][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:06:18,154][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:06:18,481][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:06:18,809][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:06:19,135][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:06:19,462][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:06:19,788][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:06:20,117][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:06:20,443][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:06:20,770][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:06:21,096][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:06:21,423][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:06:21,751][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:06:22,078][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:06:22,406][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:06:22,736][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:06:23,064][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:06:23,395][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:06:23,728][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:06:24,055][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:06:24,381][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:06:24,708][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:06:25,035][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:06:25,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:06:26,105][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:06:26,794][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:06:26,795][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:06:26,797][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:06:28,609][__main__][INFO] - Iteration 331 took 24s (38.15% Gen, 54.36% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 8m 38s. Estimated total time: 20h 11m 18s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 22s, 500 more iterations: 3h 21m 53s. [2025-11-13 10:06:28,611][__main__][INFO] - Starting iteration 331. [2025-11-13 10:06:28,615][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:06:28,616][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:06:38,111][__main__][INFO] - Number of regex retries in iteration 331: 0 [2025-11-13 10:06:38,111][__main__][INFO] - agents played in iteration 331 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:06:38,560][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:38,596][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:38,630][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:38,664][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:38,665][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:06:38,665][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:06:39,393][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:06:39,691][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:06:40,019][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:06:40,345][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:06:40,671][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:06:40,998][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:06:41,325][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:06:41,651][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:06:41,978][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:06:42,305][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:06:42,633][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:06:42,959][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:06:43,285][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:06:43,612][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:06:43,939][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:06:44,268][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:06:44,594][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:06:44,921][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:06:45,248][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:06:45,575][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:06:45,902][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:06:46,234][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:06:46,566][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:06:46,892][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:06:47,219][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:06:47,548][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:06:47,875][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:06:48,205][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:06:48,532][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:06:48,860][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:06:49,187][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:06:49,514][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:06:49,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:06:50,578][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:06:51,273][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:06:51,275][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:06:51,277][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:06:52,444][__main__][INFO] - Iteration 332 took 23s (39.85% Gen, 55.25% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 48m 26s. Estimated total time: 19h 51m 30s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 43s, 500 more iterations: 3h 18m 35s. [2025-11-13 10:06:52,447][__main__][INFO] - Starting iteration 332. [2025-11-13 10:06:52,450][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:06:52,450][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:07:01,431][__main__][INFO] - Number of regex retries in iteration 332: 0 [2025-11-13 10:07:01,432][__main__][INFO] - agents played in iteration 332 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:07:01,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:01,907][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:01,941][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:01,975][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:01,976][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:07:01,976][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:07:02,704][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:07:03,001][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:07:03,328][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:07:03,660][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:07:03,987][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:07:04,314][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:07:04,641][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:07:04,968][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:07:05,295][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:07:05,622][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:07:05,949][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:07:06,276][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:07:06,603][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:07:06,929][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:07:07,256][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:07:07,585][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:07:07,912][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:07:08,240][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:07:08,567][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:07:08,895][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:07:09,228][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:07:09,557][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:07:09,890][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:07:10,217][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:07:10,545][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:07:10,872][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:07:11,198][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:07:11,527][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:07:11,853][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:07:12,179][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:07:12,507][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:07:12,835][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:07:13,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:07:13,913][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:07:14,618][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:07:14,625][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:07:14,626][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:07:15,596][__main__][INFO] - Iteration 333 took 23s (38.80% Gen, 57.00% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 13m 56s. Estimated total time: 19h 17m 22s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 34s, 500 more iterations: 3h 12m 53s. [2025-11-13 10:07:15,599][__main__][INFO] - Starting iteration 333. [2025-11-13 10:07:15,602][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:07:15,603][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:07:24,214][__main__][INFO] - Number of regex retries in iteration 333: 0 [2025-11-13 10:07:24,215][__main__][INFO] - agents played in iteration 333 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:07:24,662][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:24,695][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:24,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:24,762][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:24,763][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:07:24,764][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:07:25,510][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:07:25,808][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:07:26,138][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:07:26,467][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:07:26,795][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:07:27,122][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:07:27,448][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:07:27,775][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:07:28,100][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:07:28,427][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:07:28,755][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:07:29,082][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:07:29,409][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:07:29,736][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:07:30,063][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:07:30,391][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:07:30,719][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:07:31,047][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:07:31,374][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:07:31,701][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:07:32,028][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:07:32,355][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:07:32,682][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:07:33,010][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:07:33,337][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:07:33,666][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:07:33,994][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:07:34,320][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:07:34,647][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:07:34,975][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:07:35,302][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:07:35,630][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:07:35,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:07:36,700][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:07:37,402][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:07:37,404][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:07:37,406][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:07:38,380][__main__][INFO] - Iteration 334 took 22s (37.81% Gen, 57.91% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 55m 5s. Estimated total time: 18h 58m 54s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 57s, 500 more iterations: 3h 9m 49s. [2025-11-13 10:07:38,382][__main__][INFO] - Starting iteration 334. [2025-11-13 10:07:38,386][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:07:38,386][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:07:47,466][__main__][INFO] - Number of regex retries in iteration 334: 0 [2025-11-13 10:07:47,467][__main__][INFO] - agents played in iteration 334 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:07:47,896][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:47,930][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:47,964][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:47,998][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:47,998][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:07:47,999][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:07:48,732][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:07:49,031][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:07:49,358][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:07:49,688][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:07:50,016][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:07:50,344][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:07:50,670][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:07:50,996][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:07:51,323][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:07:51,650][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:07:51,975][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:07:52,302][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:07:52,628][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:07:52,955][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:07:53,283][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:07:53,610][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:07:53,937][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:07:54,263][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:07:54,590][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:07:54,918][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:07:55,245][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:07:55,572][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:07:55,900][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:07:56,226][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:07:56,554][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:07:56,881][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:07:57,208][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:07:57,534][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:07:57,863][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:07:58,191][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:07:58,518][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:07:58,843][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:07:59,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:07:59,904][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:08:00,595][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:08:00,598][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:08:00,599][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:08:01,558][__main__][INFO] - Iteration 335 took 23s (39.18% Gen, 56.67% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 14m 27s. Estimated total time: 19h 18m 39s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 37s, 500 more iterations: 3h 13m 6s. [2025-11-13 10:08:01,560][__main__][INFO] - Starting iteration 335. [2025-11-13 10:08:01,563][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:08:01,564][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:08:10,231][__main__][INFO] - Number of regex retries in iteration 335: 0 [2025-11-13 10:08:10,231][__main__][INFO] - agents played in iteration 335 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:08:10,659][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:10,695][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:10,729][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:10,762][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:10,762][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:08:10,762][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:08:11,454][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:08:11,751][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:08:12,079][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:08:12,405][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:08:12,733][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:08:13,061][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:08:13,391][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:08:13,718][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:08:14,044][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:08:14,370][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:08:14,697][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:08:15,024][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:08:15,351][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:08:15,677][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:08:16,004][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:08:16,330][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:08:16,657][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:08:16,985][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:08:17,311][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:08:17,638][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:08:17,965][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:08:18,292][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:08:18,619][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:08:18,946][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:08:19,272][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:08:19,599][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:08:19,925][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:08:20,253][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:08:20,580][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:08:20,907][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:08:21,235][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:08:21,561][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:08:21,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:08:22,631][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:08:23,492][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:08:23,494][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:08:23,508][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:08:24,737][__main__][INFO] - Iteration 336 took 23s (37.40% Gen, 57.29% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 14m 7s. Estimated total time: 19h 18m 43s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 37s, 500 more iterations: 3h 13m 7s. [2025-11-13 10:08:24,739][__main__][INFO] - Starting iteration 336. [2025-11-13 10:08:24,742][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:08:24,742][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:08:33,697][__main__][INFO] - Number of regex retries in iteration 336: 0 [2025-11-13 10:08:33,698][__main__][INFO] - agents played in iteration 336 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:08:34,129][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:34,165][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:34,198][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:34,231][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:34,231][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:08:34,232][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:08:34,964][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:08:35,263][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:08:35,594][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:08:35,922][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:08:36,256][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:08:36,584][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:08:36,916][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:08:37,246][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:08:37,575][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:08:37,902][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:08:38,229][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:08:38,556][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:08:38,882][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:08:39,211][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:08:39,538][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:08:39,865][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:08:40,193][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:08:40,520][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:08:40,847][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:08:41,174][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:08:41,501][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:08:41,828][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:08:42,158][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:08:42,484][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:08:42,812][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:08:43,138][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:08:43,465][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:08:43,792][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:08:44,118][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:08:44,445][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:08:44,773][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:08:45,101][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:08:45,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:08:46,183][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:08:46,888][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:08:46,890][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:08:46,891][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:08:47,820][__main__][INFO] - Iteration 337 took 23s (38.80% Gen, 57.17% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 8m 57s. Estimated total time: 19h 13m 56s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 27s, 500 more iterations: 3h 12m 19s. [2025-11-13 10:08:47,822][__main__][INFO] - Starting iteration 337. [2025-11-13 10:08:47,826][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:08:47,827][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:08:56,315][__main__][INFO] - Number of regex retries in iteration 337: 0 [2025-11-13 10:08:56,315][__main__][INFO] - agents played in iteration 337 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:08:56,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:56,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:56,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:56,841][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:56,842][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:08:56,842][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:08:57,541][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:08:57,838][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:08:58,165][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:08:58,493][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:08:58,821][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:08:59,149][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:08:59,476][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:08:59,804][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:09:00,132][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:09:00,461][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:09:00,788][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:09:01,116][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:09:01,443][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:09:01,768][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:09:02,095][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:09:02,421][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:09:02,748][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:09:03,075][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:09:03,402][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:09:03,730][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:09:04,058][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:09:04,384][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:09:04,712][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:09:05,039][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:09:05,367][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:09:05,694][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:09:06,021][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:09:06,347][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:09:06,675][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:09:07,003][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:09:07,330][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:09:07,656][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:09:07,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:09:08,701][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:09:09,390][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:09:09,392][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:09:09,394][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:09:10,302][__main__][INFO] - Iteration 338 took 22s (37.76% Gen, 58.18% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 38m 31s. Estimated total time: 18h 43m 52s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 27s, 500 more iterations: 3h 7m 18s. [2025-11-13 10:09:10,304][__main__][INFO] - Starting iteration 338. [2025-11-13 10:09:10,307][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:09:10,308][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:09:19,606][__main__][INFO] - Number of regex retries in iteration 338: 0 [2025-11-13 10:09:19,607][__main__][INFO] - agents played in iteration 338 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:09:20,038][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:20,074][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:20,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:20,140][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:20,141][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:09:20,141][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:09:20,842][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:09:21,139][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:09:21,468][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:09:21,796][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:09:22,125][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:09:22,454][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:09:22,784][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:09:23,111][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:09:23,440][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:09:23,769][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:09:24,100][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:09:24,428][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:09:24,757][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:09:25,084][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:09:25,412][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:09:25,739][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:09:26,065][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:09:26,392][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:09:26,719][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:09:27,045][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:09:27,372][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:09:27,699][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:09:28,026][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:09:28,352][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:09:28,679][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:09:29,007][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:09:29,335][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:09:29,662][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:09:29,989][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:09:30,315][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:09:30,641][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:09:30,968][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:09:31,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:09:32,009][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:09:32,719][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:09:32,720][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:09:32,722][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:09:33,914][__main__][INFO] - Iteration 339 took 23s (39.39% Gen, 55.55% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 34m 37s. Estimated total time: 19h 40m 22s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 20s, 500 more iterations: 3h 16m 43s. [2025-11-13 10:09:33,916][__main__][INFO] - Starting iteration 339. [2025-11-13 10:09:33,918][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:09:33,918][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:09:42,448][__main__][INFO] - Number of regex retries in iteration 339: 0 [2025-11-13 10:09:42,449][__main__][INFO] - agents played in iteration 339 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:09:42,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:43,235][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:43,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:43,300][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:43,300][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:09:43,300][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:09:43,980][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:09:44,277][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:09:44,602][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:09:44,930][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:09:45,256][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:09:45,584][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:09:45,911][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:09:46,239][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:09:46,570][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:09:46,898][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:09:47,225][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:09:47,551][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:09:47,883][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:09:48,211][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:09:48,538][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:09:48,865][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:09:49,191][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:09:49,517][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:09:49,844][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:09:50,172][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:09:50,498][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:09:50,825][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:09:51,153][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:09:51,480][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:09:51,807][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:09:52,135][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:09:52,462][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:09:52,790][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:09:53,118][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:09:53,445][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:09:53,772][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:09:54,100][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:09:54,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:09:55,163][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:09:55,849][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:09:55,850][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:09:55,852][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:09:56,783][__main__][INFO] - Iteration 340 took 22s (37.31% Gen, 58.62% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 57m 8s. Estimated total time: 19h 3m 15s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 6s, 500 more iterations: 3h 10m 32s. [2025-11-13 10:09:56,785][__main__][INFO] - Starting iteration 340. [2025-11-13 10:09:56,788][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:09:56,788][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:10:05,127][__main__][INFO] - Number of regex retries in iteration 340: 0 [2025-11-13 10:10:05,128][__main__][INFO] - agents played in iteration 340 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:10:05,566][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:05,919][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:05,952][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:05,985][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:05,985][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:10:05,986][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:10:06,671][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:10:06,968][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:10:07,295][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:10:07,623][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:10:07,948][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:10:08,274][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:10:08,600][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:10:08,927][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:10:09,255][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:10:09,583][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:10:09,911][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:10:10,237][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:10:10,564][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:10:10,896][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:10:11,222][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:10:11,552][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:10:11,881][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:10:12,207][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:10:12,534][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:10:12,861][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:10:13,187][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:10:13,514][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:10:13,842][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:10:14,169][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:10:14,495][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:10:14,822][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:10:15,149][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:10:15,476][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:10:15,803][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:10:16,130][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:10:16,458][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:10:16,785][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:10:17,113][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:10:17,831][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:10:18,514][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:10:18,517][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:10:18,520][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:10:20,271][__main__][INFO] - Iteration 341 took 23s (35.51% Gen, 57.03% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 27m 42s. Estimated total time: 19h 34m 13s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 8s, 500 more iterations: 3h 15m 42s. [2025-11-13 10:10:20,274][__main__][INFO] - Starting iteration 341. [2025-11-13 10:10:20,277][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:10:20,277][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:10:29,781][__main__][INFO] - Number of regex retries in iteration 341: 0 [2025-11-13 10:10:29,782][__main__][INFO] - agents played in iteration 341 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:10:30,216][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:30,252][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:30,285][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:30,318][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:30,319][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:10:30,319][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:10:31,005][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:10:31,300][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:10:31,626][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:10:31,953][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:10:32,280][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:10:32,608][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:10:32,934][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:10:33,262][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:10:33,592][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:10:33,920][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:10:34,246][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:10:34,577][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:10:34,905][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:10:35,232][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:10:35,560][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:10:35,887][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:10:36,214][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:10:36,540][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:10:36,867][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:10:37,194][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:10:37,521][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:10:37,848][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:10:38,176][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:10:38,502][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:10:38,831][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:10:39,159][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:10:39,488][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:10:39,816][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:10:40,143][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:10:40,473][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:10:40,801][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:10:41,128][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:10:41,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:10:42,174][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:10:42,863][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:10:42,864][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:10:42,866][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:10:43,773][__main__][INFO] - Iteration 342 took 23s (40.45% Gen, 55.69% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 27m 55s. Estimated total time: 19h 34m 50s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 9s, 500 more iterations: 3h 15m 48s. [2025-11-13 10:10:43,775][__main__][INFO] - Starting iteration 342. [2025-11-13 10:10:43,778][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:10:43,778][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:10:52,710][__main__][INFO] - Number of regex retries in iteration 342: 0 [2025-11-13 10:10:52,711][__main__][INFO] - agents played in iteration 342 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:10:53,135][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:53,169][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:53,201][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:53,234][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:53,235][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:10:53,235][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:10:53,916][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:10:54,212][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:10:54,538][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:10:54,864][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:10:55,190][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:10:55,516][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:10:55,843][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:10:56,170][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:10:56,496][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:10:56,822][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:10:57,150][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:10:57,477][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:10:57,804][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:10:58,132][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:10:58,460][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:10:58,786][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:10:59,113][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:10:59,441][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:10:59,769][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:11:00,096][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:11:00,424][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:11:00,751][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:11:01,077][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:11:01,404][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:11:01,730][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:11:02,058][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:11:02,385][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:11:02,712][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:11:03,039][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:11:03,366][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:11:03,693][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:11:04,020][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:11:04,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:11:05,073][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:11:05,762][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:11:05,764][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:11:05,766][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:11:06,672][__main__][INFO] - Iteration 343 took 22s (39.01% Gen, 57.02% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 57m 27s. Estimated total time: 19h 4m 44s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 9s, 500 more iterations: 3h 10m 47s. [2025-11-13 10:11:06,674][__main__][INFO] - Starting iteration 343. [2025-11-13 10:11:06,677][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:11:06,678][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:11:15,261][__main__][INFO] - Number of regex retries in iteration 343: 0 [2025-11-13 10:11:15,262][__main__][INFO] - agents played in iteration 343 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:11:15,715][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:15,762][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:15,798][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:15,831][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:15,832][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:11:15,833][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:11:16,826][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:11:17,123][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:11:17,450][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:11:17,776][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:11:18,102][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:11:18,429][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:11:18,755][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:11:19,082][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:11:19,409][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:11:19,735][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:11:20,060][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:11:20,387][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:11:20,712][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:11:21,041][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:11:21,367][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:11:21,693][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:11:22,019][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:11:22,347][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:11:22,675][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:11:23,001][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:11:23,330][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:11:23,656][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:11:23,984][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:11:24,310][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:11:24,637][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:11:24,964][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:11:25,292][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:11:25,619][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:11:25,947][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:11:26,274][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:11:26,601][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:11:26,929][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:11:27,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:11:27,978][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:11:28,664][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:11:28,666][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:11:28,668][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:11:29,555][__main__][INFO] - Iteration 344 took 22s (37.52% Gen, 58.59% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 56m 15s. Estimated total time: 19h 3m 55s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 7s, 500 more iterations: 3h 10m 39s. [2025-11-13 10:11:29,557][__main__][INFO] - Starting iteration 344. [2025-11-13 10:11:29,560][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:11:29,561][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:11:39,290][__main__][INFO] - Number of regex retries in iteration 344: 0 [2025-11-13 10:11:39,290][__main__][INFO] - agents played in iteration 344 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:11:39,712][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:39,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:39,778][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:39,812][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:39,812][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:11:39,812][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:11:40,493][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:11:40,789][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:11:41,118][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:11:41,445][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:11:41,771][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:11:42,097][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:11:42,429][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:11:42,754][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:11:43,083][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:11:43,410][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:11:43,738][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:11:44,065][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:11:44,396][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:11:44,723][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:11:45,051][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:11:45,377][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:11:45,706][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:11:46,038][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:11:46,364][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:11:46,693][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:11:47,020][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:11:47,347][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:11:47,674][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:11:48,001][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:11:48,328][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:11:48,655][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:11:48,981][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:11:49,309][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:11:49,636][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:11:49,963][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:11:50,291][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:11:50,620][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:11:50,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:11:51,670][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:11:52,352][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:11:52,353][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:11:52,355][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:11:53,307][__main__][INFO] - Iteration 345 took 23s (40.97% Gen, 55.01% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 39m 19s. Estimated total time: 19h 47m 23s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 34s, 500 more iterations: 3h 17m 53s. [2025-11-13 10:11:53,309][__main__][INFO] - Starting iteration 345. [2025-11-13 10:11:53,312][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:11:53,313][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:12:02,710][__main__][INFO] - Number of regex retries in iteration 345: 0 [2025-11-13 10:12:02,710][__main__][INFO] - agents played in iteration 345 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:12:03,132][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:03,166][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:03,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:03,232][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:03,233][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:12:03,233][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:12:03,915][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:12:04,211][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:12:04,539][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:12:04,864][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:12:05,191][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:12:05,517][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:12:05,842][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:12:06,169][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:12:06,493][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:12:06,820][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:12:07,148][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:12:07,473][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:12:07,799][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:12:08,125][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:12:08,453][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:12:08,779][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:12:09,106][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:12:09,438][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:12:09,765][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:12:10,097][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:12:10,424][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:12:10,753][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:12:11,079][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:12:11,406][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:12:11,732][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:12:12,059][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:12:12,385][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:12:12,712][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:12:13,040][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:12:13,368][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:12:13,695][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:12:14,022][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:12:14,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:12:15,074][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:12:15,753][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:12:15,755][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:12:15,756][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:12:16,638][__main__][INFO] - Iteration 346 took 23s (40.28% Gen, 55.93% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 17m 53s. Estimated total time: 19h 26m 20s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 52s, 500 more iterations: 3h 14m 23s. [2025-11-13 10:12:16,640][__main__][INFO] - Starting iteration 346. [2025-11-13 10:12:16,643][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:12:16,643][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:12:25,009][__main__][INFO] - Number of regex retries in iteration 346: 0 [2025-11-13 10:12:25,010][__main__][INFO] - agents played in iteration 346 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:12:25,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:25,482][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:25,515][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:25,548][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:25,548][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:12:25,549][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:12:26,227][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:12:26,523][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:12:26,848][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:12:27,175][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:12:27,502][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:12:27,829][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:12:28,155][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:12:28,481][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:12:28,807][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:12:29,134][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:12:29,461][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:12:29,790][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:12:30,116][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:12:30,441][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:12:30,769][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:12:31,095][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:12:31,422][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:12:31,750][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:12:32,077][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:12:32,404][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:12:32,733][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:12:33,062][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:12:33,389][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:12:33,716][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:12:34,044][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:12:34,371][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:12:34,699][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:12:35,027][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:12:35,354][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:12:35,681][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:12:36,008][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:12:36,335][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:12:36,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:12:37,395][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:12:38,090][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:12:38,092][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:12:38,093][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:12:38,966][__main__][INFO] - Iteration 347 took 22s (37.47% Gen, 58.61% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 27m 22s. Estimated total time: 18h 36m 12s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 12s, 500 more iterations: 3h 6m 2s. [2025-11-13 10:12:38,968][__main__][INFO] - Starting iteration 347. [2025-11-13 10:12:38,971][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:12:38,971][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:12:48,559][__main__][INFO] - Number of regex retries in iteration 347: 0 [2025-11-13 10:12:48,559][__main__][INFO] - agents played in iteration 347 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:12:48,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:49,029][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:49,063][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:49,098][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:49,098][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:12:49,099][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:12:49,794][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:12:50,091][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:12:50,418][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:12:50,745][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:12:51,072][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:12:51,399][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:12:51,727][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:12:52,053][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:12:52,381][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:12:52,709][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:12:53,035][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:12:53,363][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:12:53,689][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:12:54,021][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:12:54,347][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:12:54,674][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:12:55,001][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:12:55,328][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:12:55,655][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:12:55,983][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:12:56,309][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:12:56,638][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:12:56,966][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:12:57,293][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:12:57,621][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:12:57,949][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:12:58,275][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:12:58,603][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:12:58,931][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:12:59,259][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:12:59,585][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:12:59,912][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:13:00,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:13:00,974][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:13:01,665][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:13:01,666][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:13:01,668][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:13:02,678][__main__][INFO] - Iteration 348 took 23s (40.44% Gen, 55.30% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 36m 10s. Estimated total time: 19h 45m 24s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 30s, 500 more iterations: 3h 17m 34s. [2025-11-13 10:13:02,680][__main__][INFO] - Starting iteration 348. [2025-11-13 10:13:02,683][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:13:02,684][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:13:11,824][__main__][INFO] - Number of regex retries in iteration 348: 0 [2025-11-13 10:13:11,825][__main__][INFO] - agents played in iteration 348 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:13:12,263][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:12,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:12,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:12,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:12,365][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:13:12,365][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:13:13,092][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:13:13,389][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:13:13,716][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:13:14,045][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:13:14,376][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:13:14,702][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:13:15,029][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:13:15,356][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:13:15,683][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:13:16,010][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:13:16,340][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:13:16,666][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:13:16,991][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:13:17,318][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:13:17,646][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:13:17,975][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:13:18,302][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:13:18,628][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:13:18,954][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:13:19,282][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:13:19,608][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:13:19,937][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:13:20,264][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:13:20,592][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:13:20,919][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:13:21,246][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:13:21,574][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:13:21,902][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:13:22,229][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:13:22,556][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:13:22,883][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:13:23,210][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:13:23,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:13:24,271][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:13:24,962][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:13:24,963][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:13:24,965][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:13:25,964][__main__][INFO] - Iteration 349 took 23s (39.26% Gen, 56.44% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 14m 28s. Estimated total time: 19h 24m 5s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 48s, 500 more iterations: 3h 14m 0s. [2025-11-13 10:13:25,967][__main__][INFO] - Starting iteration 349. [2025-11-13 10:13:25,970][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:13:25,970][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:13:35,883][__main__][INFO] - Number of regex retries in iteration 349: 0 [2025-11-13 10:13:35,883][__main__][INFO] - agents played in iteration 349 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:13:36,312][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:36,346][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:36,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:36,414][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:36,415][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:13:36,415][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:13:37,144][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:13:37,441][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:13:37,766][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:13:38,095][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:13:38,424][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:13:38,753][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:13:39,081][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:13:39,409][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:13:39,736][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:13:40,065][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:13:40,394][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:13:40,720][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:13:41,048][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:13:41,376][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:13:41,703][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:13:42,030][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:13:42,358][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:13:42,685][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:13:43,014][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:13:43,345][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:13:43,673][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:13:44,002][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:13:44,329][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:13:44,658][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:13:44,990][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:13:45,317][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:13:45,644][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:13:45,972][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:13:46,298][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:13:46,624][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:13:46,953][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:13:47,280][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:13:47,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:13:48,342][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:13:49,063][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:13:49,065][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:13:49,066][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:13:49,955][__main__][INFO] - Iteration 350 took 23s (41.33% Gen, 54.96% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 49m 18s. Estimated total time: 19h 59m 18s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 58s, 500 more iterations: 3h 19m 53s. [2025-11-13 10:13:49,957][__main__][INFO] - Starting iteration 350. [2025-11-13 10:13:49,960][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:13:49,960][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:13:59,701][__main__][INFO] - Number of regex retries in iteration 350: 0 [2025-11-13 10:13:59,701][__main__][INFO] - agents played in iteration 350 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:14:00,142][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:00,177][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:00,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:00,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:00,244][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:14:00,245][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:14:00,961][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:14:01,259][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:14:01,587][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:14:01,912][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:14:02,239][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:14:02,566][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:14:02,892][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:14:03,222][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:14:03,555][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:14:03,882][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:14:04,210][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:14:04,536][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:14:04,863][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:14:05,189][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:14:05,517][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:14:05,842][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:14:06,174][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:14:06,503][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:14:06,837][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:14:07,169][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:14:07,498][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:14:07,827][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:14:08,155][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:14:08,481][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:14:08,809][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:14:09,135][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:14:09,462][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:14:09,789][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:14:10,116][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:14:10,444][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:14:10,770][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:14:11,097][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:14:11,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:14:12,150][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:14:12,853][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:14:12,854][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:14:12,855][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:14:14,747][__main__][INFO] - Iteration 351 took 24s (39.29% Gen, 53.07% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 28m 58s. Estimated total time: 20h 39m 24s. Time estimates for 10 more iterations: 4m 7s, 100 more iterations: 41m 18s, 500 more iterations: 3h 26m 34s. [2025-11-13 10:14:14,754][__main__][INFO] - Starting iteration 351. [2025-11-13 10:14:14,757][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:14:14,758][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:14:24,677][__main__][INFO] - Number of regex retries in iteration 351: 0 [2025-11-13 10:14:24,678][__main__][INFO] - agents played in iteration 351 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:14:25,119][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:25,154][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:25,187][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:25,221][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:25,222][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:14:25,223][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:14:25,924][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:14:26,219][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:14:26,545][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:14:26,876][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:14:27,202][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:14:27,528][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:14:27,857][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:14:28,186][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:14:28,512][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:14:28,837][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:14:29,162][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:14:29,495][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:14:29,826][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:14:30,153][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:14:30,480][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:14:30,807][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:14:31,134][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:14:31,461][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:14:31,787][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:14:32,113][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:14:32,442][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:14:32,768][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:14:33,096][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:14:33,423][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:14:33,750][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:14:34,078][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:14:34,404][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:14:34,731][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:14:35,059][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:14:35,386][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:14:35,713][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:14:36,041][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:14:36,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:14:37,103][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:14:37,811][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:14:37,813][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:14:37,814][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:14:38,759][__main__][INFO] - Iteration 352 took 24s (41.33% Gen, 54.73% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 49m 17s. Estimated total time: 20h 0m 7s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 0s, 500 more iterations: 3h 20m 1s. [2025-11-13 10:14:38,761][__main__][INFO] - Starting iteration 352. [2025-11-13 10:14:38,764][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:14:38,764][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:14:48,212][__main__][INFO] - Number of regex retries in iteration 352: 0 [2025-11-13 10:14:48,213][__main__][INFO] - agents played in iteration 352 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:14:48,647][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:48,681][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:48,714][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:48,748][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:48,749][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:14:48,749][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:14:49,444][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:14:49,740][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:14:50,066][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:14:50,392][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:14:50,718][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:14:51,044][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:14:51,370][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:14:51,697][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:14:52,023][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:14:52,351][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:14:52,677][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:14:53,004][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:14:53,332][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:14:53,661][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:14:53,989][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:14:54,320][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:14:54,648][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:14:54,976][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:14:55,303][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:14:55,630][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:14:55,957][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:14:56,284][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:14:56,611][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:14:56,937][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:14:57,263][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:14:57,590][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:14:57,917][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:14:58,244][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:14:58,572][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:14:58,898][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:14:59,225][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:14:59,552][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:14:59,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:15:00,608][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:15:01,305][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:15:01,306][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:15:01,307][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:15:02,233][__main__][INFO] - Iteration 353 took 23s (40.26% Gen, 55.79% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 22m 18s. Estimated total time: 19h 33m 31s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 7s, 500 more iterations: 3h 15m 35s. [2025-11-13 10:15:02,235][__main__][INFO] - Starting iteration 353. [2025-11-13 10:15:02,238][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:15:02,238][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:15:12,016][__main__][INFO] - Number of regex retries in iteration 353: 0 [2025-11-13 10:15:12,017][__main__][INFO] - agents played in iteration 353 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:15:12,455][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:12,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:12,523][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:12,556][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:12,557][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:15:12,558][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:15:13,245][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:15:13,542][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:15:13,868][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:15:14,194][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:15:14,520][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:15:14,846][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:15:15,171][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:15:15,499][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:15:15,823][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:15:16,150][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:15:16,477][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:15:16,805][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:15:17,131][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:15:17,460][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:15:17,792][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:15:18,122][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:15:18,451][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:15:18,778][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:15:19,105][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:15:19,432][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:15:19,760][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:15:20,087][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:15:20,415][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:15:20,742][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:15:21,070][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:15:21,397][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:15:21,729][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:15:22,055][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:15:22,382][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:15:22,709][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:15:23,036][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:15:23,365][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:15:23,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:15:24,425][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:15:25,118][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:15:25,120][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:15:25,121][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:15:26,093][__main__][INFO] - Iteration 354 took 23s (40.99% Gen, 54.93% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 41m 12s. Estimated total time: 19h 52m 49s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 45s, 500 more iterations: 3h 18m 48s. [2025-11-13 10:15:26,095][__main__][INFO] - Starting iteration 354. [2025-11-13 10:15:26,098][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:15:26,099][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:15:35,261][__main__][INFO] - Number of regex retries in iteration 354: 0 [2025-11-13 10:15:35,261][__main__][INFO] - agents played in iteration 354 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:15:35,693][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:35,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:35,762][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:35,797][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:35,797][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:15:35,798][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:15:36,510][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:15:36,807][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:15:37,134][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:15:37,459][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:15:37,785][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:15:38,111][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:15:38,437][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:15:38,763][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:15:39,090][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:15:39,416][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:15:39,742][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:15:40,074][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:15:40,404][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:15:40,731][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:15:41,060][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:15:41,385][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:15:41,711][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:15:42,039][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:15:42,367][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:15:42,696][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:15:43,022][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:15:43,349][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:15:43,678][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:15:44,004][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:15:44,331][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:15:44,659][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:15:44,986][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:15:45,314][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:15:45,641][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:15:45,968][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:15:46,295][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:15:46,622][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:15:46,950][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:15:47,680][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:15:48,372][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:15:48,373][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:15:48,375][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:15:49,358][__main__][INFO] - Iteration 355 took 23s (39.39% Gen, 56.38% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 11m 3s. Estimated total time: 19h 23m 3s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 46s, 500 more iterations: 3h 13m 50s. [2025-11-13 10:15:49,360][__main__][INFO] - Starting iteration 355. [2025-11-13 10:15:49,363][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:15:49,363][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:15:58,538][__main__][INFO] - Number of regex retries in iteration 355: 0 [2025-11-13 10:15:58,539][__main__][INFO] - agents played in iteration 355 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:15:58,976][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:59,009][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:59,042][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:59,075][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:59,076][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:15:59,076][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:15:59,795][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:16:00,090][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:16:00,416][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:16:00,742][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:16:01,068][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:16:01,394][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:16:01,719][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:16:02,045][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:16:02,370][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:16:02,695][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:16:03,024][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:16:03,351][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:16:03,677][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:16:04,002][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:16:04,329][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:16:04,659][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:16:04,988][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:16:05,320][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:16:05,649][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:16:05,975][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:16:06,302][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:16:06,632][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:16:06,959][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:16:07,285][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:16:07,611][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:16:07,937][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:16:08,264][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:16:08,592][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:16:08,919][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:16:09,247][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:16:09,573][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:16:09,901][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:16:10,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:16:10,956][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:16:11,672][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:16:11,675][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:16:11,676][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:16:12,797][__main__][INFO] - Iteration 356 took 23s (39.15% Gen, 56.06% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 19m 22s. Estimated total time: 19h 31m 46s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 3s, 500 more iterations: 3h 15m 17s. [2025-11-13 10:16:12,800][__main__][INFO] - Starting iteration 356. [2025-11-13 10:16:12,802][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:16:12,803][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:16:22,293][__main__][INFO] - Number of regex retries in iteration 356: 0 [2025-11-13 10:16:22,294][__main__][INFO] - agents played in iteration 356 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:16:22,723][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:23,090][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:23,124][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:23,158][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:23,158][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:16:23,158][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:16:23,867][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:16:24,164][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:16:24,492][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:16:24,819][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:16:25,145][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:16:25,470][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:16:25,797][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:16:26,124][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:16:26,450][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:16:26,783][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:16:27,115][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:16:27,446][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:16:27,773][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:16:28,103][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:16:28,431][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:16:28,764][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:16:29,090][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:16:29,417][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:16:29,747][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:16:30,078][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:16:30,405][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:16:30,731][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:16:31,060][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:16:31,387][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:16:31,714][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:16:32,040][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:16:32,367][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:16:32,694][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:16:33,022][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:16:33,349][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:16:33,675][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:16:34,002][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:16:34,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:16:35,057][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:16:35,801][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:16:35,802][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:16:35,804][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:16:36,782][__main__][INFO] - Iteration 357 took 23s (39.57% Gen, 56.34% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 46m 15s. Estimated total time: 19h 59m 3s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 58s, 500 more iterations: 3h 19m 50s. [2025-11-13 10:16:36,784][__main__][INFO] - Starting iteration 357. [2025-11-13 10:16:36,787][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:16:36,787][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:16:45,729][__main__][INFO] - Number of regex retries in iteration 357: 0 [2025-11-13 10:16:45,729][__main__][INFO] - agents played in iteration 357 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:16:46,235][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:46,268][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:46,300][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:46,333][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:46,334][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:16:46,334][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:16:47,015][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:16:47,310][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:16:47,637][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:16:47,963][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:16:48,290][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:16:48,616][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:16:48,942][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:16:49,268][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:16:49,595][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:16:49,922][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:16:50,252][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:16:50,580][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:16:50,908][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:16:51,235][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:16:51,565][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:16:51,894][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:16:52,222][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:16:52,549][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:16:52,878][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:16:53,206][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:16:53,533][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:16:53,859][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:16:54,186][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:16:54,513][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:16:54,840][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:16:55,166][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:16:55,492][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:16:55,818][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:16:56,145][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:16:56,472][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:16:56,800][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:16:57,128][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:16:57,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:16:58,189][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:16:58,901][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:16:58,902][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:16:58,904][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:16:59,781][__main__][INFO] - Iteration 358 took 22s (38.88% Gen, 57.29% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 56m 35s. Estimated total time: 19h 9m 46s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 19s, 500 more iterations: 3h 11m 37s. [2025-11-13 10:16:59,784][__main__][INFO] - Starting iteration 358. [2025-11-13 10:16:59,787][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:16:59,787][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:17:08,936][__main__][INFO] - Number of regex retries in iteration 358: 0 [2025-11-13 10:17:08,937][__main__][INFO] - agents played in iteration 358 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:17:09,390][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:09,424][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:09,459][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:09,493][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:09,493][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:17:09,494][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:17:10,182][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:17:10,478][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:17:10,806][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:17:11,130][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:17:11,458][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:17:11,784][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:17:12,111][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:17:12,439][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:17:12,767][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:17:13,093][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:17:13,420][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:17:13,746][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:17:14,074][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:17:14,399][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:17:14,727][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:17:15,053][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:17:15,380][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:17:15,712][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:17:16,040][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:17:16,367][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:17:16,696][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:17:17,023][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:17:17,350][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:17:17,677][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:17:18,005][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:17:18,332][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:17:18,661][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:17:18,987][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:17:19,314][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:17:19,641][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:17:19,968][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:17:20,294][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:17:20,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:17:21,352][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:17:22,074][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:17:22,076][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:17:22,078][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:17:23,147][__main__][INFO] - Iteration 359 took 23s (39.16% Gen, 56.25% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 14m 30s. Estimated total time: 19h 28m 4s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 56s, 500 more iterations: 3h 14m 40s. [2025-11-13 10:17:23,149][__main__][INFO] - Starting iteration 359. [2025-11-13 10:17:23,152][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:17:23,153][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:17:32,862][__main__][INFO] - Number of regex retries in iteration 359: 0 [2025-11-13 10:17:32,863][__main__][INFO] - agents played in iteration 359 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:17:33,289][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:33,322][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:33,357][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:33,391][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:33,392][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:17:33,392][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:17:34,107][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:17:34,402][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:17:34,728][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:17:35,053][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:17:35,380][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:17:35,705][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:17:36,032][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:17:36,359][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:17:36,686][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:17:37,013][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:17:37,342][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:17:37,669][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:17:37,997][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:17:38,325][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:17:38,656][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:17:38,985][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:17:39,314][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:17:39,643][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:17:39,970][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:17:40,298][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:17:40,624][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:17:40,951][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:17:41,278][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:17:41,605][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:17:41,932][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:17:42,261][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:17:42,587][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:17:42,914][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:17:43,242][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:17:43,569][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:17:43,895][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:17:44,222][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:17:44,549][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:17:45,276][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:17:46,009][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:17:46,010][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:17:46,012][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:17:47,071][__main__][INFO] - Iteration 360 took 23s (40.59% Gen, 54.97% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 42m 0s. Estimated total time: 19h 55m 58s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 51s, 500 more iterations: 3h 19m 19s. [2025-11-13 10:17:47,073][__main__][INFO] - Starting iteration 360. [2025-11-13 10:17:47,076][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:17:47,077][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:17:56,389][__main__][INFO] - Number of regex retries in iteration 360: 0 [2025-11-13 10:17:56,390][__main__][INFO] - agents played in iteration 360 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:17:56,819][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:56,853][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:56,886][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:56,920][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:56,920][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:17:56,921][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:17:57,645][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:17:57,943][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:17:58,271][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:17:58,598][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:17:58,925][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:17:59,252][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:17:59,578][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:17:59,904][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:18:00,229][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:18:00,555][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:18:00,885][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:18:01,212][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:18:01,542][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:18:01,874][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:18:02,204][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:18:02,530][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:18:02,862][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:18:03,194][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:18:03,520][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:18:03,849][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:18:04,177][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:18:04,504][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:18:04,830][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:18:05,157][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:18:05,484][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:18:05,811][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:18:06,138][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:18:06,465][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:18:06,791][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:18:07,119][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:18:07,446][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:18:07,772][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:18:08,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:18:08,818][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:18:09,520][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:18:09,521][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:18:09,523][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:18:11,455][__main__][INFO] - Iteration 361 took 24s (38.20% Gen, 53.87% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 4m 38s. Estimated total time: 20h 19m 0s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 38s, 500 more iterations: 3h 23m 10s. [2025-11-13 10:18:11,458][__main__][INFO] - Starting iteration 361. [2025-11-13 10:18:11,461][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:18:11,461][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:18:21,633][__main__][INFO] - Number of regex retries in iteration 361: 0 [2025-11-13 10:18:21,634][__main__][INFO] - agents played in iteration 361 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:18:22,086][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:22,123][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:22,156][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:22,189][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:22,189][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:18:22,190][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:18:22,881][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:18:23,177][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:18:23,506][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:18:23,834][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:18:24,162][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:18:24,487][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:18:24,815][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:18:25,143][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:18:25,471][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:18:25,797][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:18:26,128][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:18:26,454][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:18:26,786][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:18:27,118][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:18:27,446][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:18:27,773][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:18:28,099][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:18:28,425][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:18:28,753][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:18:29,080][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:18:29,408][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:18:29,735][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:18:30,063][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:18:30,389][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:18:30,716][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:18:31,042][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:18:31,369][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:18:31,696][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:18:32,023][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:18:32,350][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:18:32,676][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:18:33,004][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:18:33,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:18:34,050][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:18:34,778][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:18:34,779][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:18:34,781][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:18:35,676][__main__][INFO] - Iteration 362 took 24s (42.01% Gen, 54.29% Train). Generation: 10s, Training: 13s. Estimated remaining time: 17h 56m 2s. Estimated total time: 20h 10m 49s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 21s, 500 more iterations: 3h 21m 48s. [2025-11-13 10:18:35,678][__main__][INFO] - Starting iteration 362. [2025-11-13 10:18:35,681][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:18:35,682][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:18:44,897][__main__][INFO] - Number of regex retries in iteration 362: 0 [2025-11-13 10:18:44,898][__main__][INFO] - agents played in iteration 362 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:18:45,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:45,367][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:45,400][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:45,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:45,433][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:18:45,434][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:18:46,123][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:18:46,419][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:18:46,749][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:18:47,080][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:18:47,409][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:18:47,735][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:18:48,064][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:18:48,392][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:18:48,720][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:18:49,049][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:18:49,377][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:18:49,708][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:18:50,040][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:18:50,367][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:18:50,696][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:18:51,024][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:18:51,353][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:18:51,681][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:18:52,007][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:18:52,335][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:18:52,662][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:18:52,990][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:18:53,316][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:18:53,644][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:18:53,972][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:18:54,299][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:18:54,626][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:18:54,953][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:18:55,280][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:18:55,608][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:18:55,935][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:18:56,262][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:18:56,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:18:57,331][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:18:58,069][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:18:58,070][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:18:58,072][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:18:59,055][__main__][INFO] - Iteration 363 took 23s (39.43% Gen, 56.36% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 13m 33s. Estimated total time: 19h 28m 43s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 57s, 500 more iterations: 3h 14m 47s. [2025-11-13 10:18:59,057][__main__][INFO] - Starting iteration 363. [2025-11-13 10:18:59,059][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:18:59,060][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:19:08,490][__main__][INFO] - Number of regex retries in iteration 363: 0 [2025-11-13 10:19:08,491][__main__][INFO] - agents played in iteration 363 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:19:08,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:08,952][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:08,985][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:09,018][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:09,019][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:19:09,019][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:19:09,700][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:19:09,997][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:19:10,323][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:19:10,650][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:19:10,976][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:19:11,305][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:19:11,630][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:19:11,958][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:19:12,285][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:19:12,617][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:19:12,949][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:19:13,281][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:19:13,613][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:19:13,939][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:19:14,270][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:19:14,597][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:19:14,924][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:19:15,253][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:19:15,579][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:19:15,907][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:19:16,234][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:19:16,560][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:19:16,888][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:19:17,214][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:19:17,540][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:19:17,867][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:19:18,194][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:19:18,521][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:19:18,849][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:19:19,175][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:19:19,502][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:19:19,830][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:19:20,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:19:20,880][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:19:21,600][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:19:21,601][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:19:21,602][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:19:22,496][__main__][INFO] - Iteration 364 took 23s (40.24% Gen, 55.94% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 16m 17s. Estimated total time: 19h 31m 51s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 3s, 500 more iterations: 3h 15m 18s. [2025-11-13 10:19:22,498][__main__][INFO] - Starting iteration 364. [2025-11-13 10:19:22,501][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:19:22,501][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:19:31,627][__main__][INFO] - Number of regex retries in iteration 364: 0 [2025-11-13 10:19:31,628][__main__][INFO] - agents played in iteration 364 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:19:32,055][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:32,091][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:32,124][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:32,157][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:32,158][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:19:32,158][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:19:32,841][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:19:33,136][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:19:33,467][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:19:33,793][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:19:34,120][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:19:34,446][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:19:34,773][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:19:35,101][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:19:35,430][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:19:35,758][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:19:36,089][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:19:36,415][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:19:36,744][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:19:37,071][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:19:37,399][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:19:37,725][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:19:38,052][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:19:38,379][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:19:38,706][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:19:39,034][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:19:39,361][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:19:39,688][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:19:40,016][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:19:40,343][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:19:40,670][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:19:40,998][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:19:41,325][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:19:41,651][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:19:41,978][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:19:42,306][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:19:42,633][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:19:42,960][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:19:43,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:19:44,021][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:19:44,736][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:19:44,738][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:19:44,739][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:19:45,624][__main__][INFO] - Iteration 365 took 23s (39.47% Gen, 56.70% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 0m 15s. Estimated total time: 19h 16m 11s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 32s, 500 more iterations: 3h 12m 41s. [2025-11-13 10:19:45,626][__main__][INFO] - Starting iteration 365. [2025-11-13 10:19:45,629][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:19:45,629][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:19:55,167][__main__][INFO] - Number of regex retries in iteration 365: 0 [2025-11-13 10:19:55,168][__main__][INFO] - agents played in iteration 365 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:19:55,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:55,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:55,660][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:55,693][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:55,694][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:19:55,695][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:19:56,378][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:19:56,674][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:19:57,001][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:19:57,331][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:19:57,662][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:19:57,990][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:19:58,319][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:19:58,648][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:19:58,978][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:19:59,305][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:19:59,633][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:19:59,964][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:20:00,291][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:20:00,619][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:20:00,949][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:20:01,275][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:20:01,602][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:20:01,929][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:20:02,256][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:20:02,582][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:20:02,909][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:20:03,235][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:20:03,562][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:20:03,888][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:20:04,214][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:20:04,542][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:20:04,869][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:20:05,197][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:20:05,524][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:20:05,854][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:20:06,185][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:20:06,517][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:20:06,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:20:07,577][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:20:08,308][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:20:08,309][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:20:08,311][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:20:09,211][__main__][INFO] - Iteration 366 took 23s (40.44% Gen, 55.73% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 22m 49s. Estimated total time: 19h 39m 9s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 18s, 500 more iterations: 3h 16m 31s. [2025-11-13 10:20:09,213][__main__][INFO] - Starting iteration 366. [2025-11-13 10:20:09,216][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:20:09,217][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:20:18,547][__main__][INFO] - Number of regex retries in iteration 366: 0 [2025-11-13 10:20:18,548][__main__][INFO] - agents played in iteration 366 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:20:19,001][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:19,047][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:19,083][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:19,118][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:19,118][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:20:19,119][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:20:19,823][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:20:20,118][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:20:20,444][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:20:20,771][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:20:21,098][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:20:21,424][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:20:21,751][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:20:22,079][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:20:22,406][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:20:22,733][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:20:23,061][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:20:23,390][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:20:23,718][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:20:24,045][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:20:24,372][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:20:24,699][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:20:25,025][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:20:25,352][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:20:25,678][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:20:26,005][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:20:26,332][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:20:26,661][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:20:26,988][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:20:27,315][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:20:27,641][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:20:27,968][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:20:28,294][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:20:28,620][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:20:28,947][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:20:29,273][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:20:29,599][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:20:29,926][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:20:30,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:20:30,983][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:20:31,708][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:20:31,709][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:20:31,711][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:20:32,821][__main__][INFO] - Iteration 367 took 23s (39.53% Gen, 55.76% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 23m 32s. Estimated total time: 19h 40m 15s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 20s, 500 more iterations: 3h 16m 42s. [2025-11-13 10:20:32,823][__main__][INFO] - Starting iteration 367. [2025-11-13 10:20:32,825][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:20:32,826][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:20:42,018][__main__][INFO] - Number of regex retries in iteration 367: 0 [2025-11-13 10:20:42,019][__main__][INFO] - agents played in iteration 367 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:20:42,445][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:42,481][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:42,515][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:42,550][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:42,550][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:20:42,551][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:20:43,278][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:20:43,575][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:20:43,903][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:20:44,233][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:20:44,561][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:20:44,889][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:20:45,221][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:20:45,551][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:20:45,879][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:20:46,208][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:20:46,539][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:20:46,864][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:20:47,193][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:20:47,521][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:20:47,848][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:20:48,173][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:20:48,501][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:20:48,827][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:20:49,154][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:20:49,481][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:20:49,809][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:20:50,135][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:20:50,463][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:20:50,790][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:20:51,117][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:20:51,444][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:20:51,771][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:20:52,099][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:20:52,427][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:20:52,753][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:20:53,079][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:20:53,407][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:20:53,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:20:54,447][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:20:55,180][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:20:55,182][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:20:55,183][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:20:56,202][__main__][INFO] - Iteration 368 took 23s (39.32% Gen, 56.31% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 11m 45s. Estimated total time: 19h 28m 52s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 57s, 500 more iterations: 3h 14m 48s. [2025-11-13 10:20:56,208][__main__][INFO] - Starting iteration 368. [2025-11-13 10:20:56,211][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:20:56,211][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:21:05,909][__main__][INFO] - Number of regex retries in iteration 368: 0 [2025-11-13 10:21:05,910][__main__][INFO] - agents played in iteration 368 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:21:06,349][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:06,383][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:06,417][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:06,451][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:06,452][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:21:06,452][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:21:07,194][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:21:07,492][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:21:07,823][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:21:08,154][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:21:08,483][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:21:08,812][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:21:09,144][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:21:09,472][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:21:09,800][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:21:10,128][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:21:10,455][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:21:10,782][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:21:11,111][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:21:11,440][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:21:11,767][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:21:12,095][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:21:12,421][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:21:12,748][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:21:13,075][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:21:13,402][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:21:13,729][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:21:14,056][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:21:14,384][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:21:14,711][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:21:15,038][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:21:15,365][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:21:15,692][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:21:16,019][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:21:16,346][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:21:16,674][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:21:17,001][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:21:17,329][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:21:17,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:21:18,395][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:21:19,123][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:21:19,125][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:21:19,126][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:21:20,162][__main__][INFO] - Iteration 369 took 23s (40.49% Gen, 55.18% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 40m 7s. Estimated total time: 19h 57m 38s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 55s, 500 more iterations: 3h 19m 36s. [2025-11-13 10:21:20,164][__main__][INFO] - Starting iteration 369. [2025-11-13 10:21:20,167][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:21:20,168][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:21:29,563][__main__][INFO] - Number of regex retries in iteration 369: 0 [2025-11-13 10:21:29,563][__main__][INFO] - agents played in iteration 369 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:21:29,991][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:30,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:30,057][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:30,090][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:30,090][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:21:30,091][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:21:30,825][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:21:31,122][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:21:31,449][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:21:31,776][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:21:32,104][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:21:32,432][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:21:32,760][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:21:33,088][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:21:33,418][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:21:33,748][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:21:34,077][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:21:34,405][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:21:34,731][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:21:35,058][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:21:35,385][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:21:35,712][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:21:36,039][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:21:36,365][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:21:36,692][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:21:37,018][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:21:37,345][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:21:37,672][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:21:37,999][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:21:38,327][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:21:38,654][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:21:38,981][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:21:39,308][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:21:39,635][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:21:39,963][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:21:40,288][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:21:40,615][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:21:40,943][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:21:41,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:21:42,011][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:21:42,744][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:21:42,745][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:21:42,772][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:21:43,901][__main__][INFO] - Iteration 370 took 23s (39.59% Gen, 55.65% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 28m 49s. Estimated total time: 19h 46m 44s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 33s, 500 more iterations: 3h 17m 47s. [2025-11-13 10:21:43,903][__main__][INFO] - Starting iteration 370. [2025-11-13 10:21:43,906][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:21:43,906][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:21:53,127][__main__][INFO] - Number of regex retries in iteration 370: 0 [2025-11-13 10:21:53,127][__main__][INFO] - agents played in iteration 370 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:21:53,565][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:53,601][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:53,635][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:53,669][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:53,670][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:21:53,670][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:21:54,370][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:21:54,666][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:21:54,992][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:21:55,319][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:21:55,645][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:21:55,973][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:21:56,305][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:21:56,634][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:21:56,965][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:21:57,294][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:21:57,620][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:21:57,949][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:21:58,279][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:21:58,606][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:21:58,933][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:21:59,260][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:21:59,586][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:21:59,913][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:22:00,240][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:22:00,567][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:22:00,893][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:22:01,219][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:22:01,546][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:22:01,876][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:22:02,202][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:22:02,529][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:22:02,856][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:22:03,183][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:22:03,512][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:22:03,838][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:22:04,164][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:22:04,492][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:22:04,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:22:05,553][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:22:06,278][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:22:06,279][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:22:06,281][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:22:08,065][__main__][INFO] - Iteration 371 took 24s (38.16% Gen, 54.45% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 49m 41s. Estimated total time: 20h 8m 0s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 16s, 500 more iterations: 3h 21m 20s. [2025-11-13 10:22:08,067][__main__][INFO] - Starting iteration 371. [2025-11-13 10:22:08,070][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:22:08,070][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:22:16,754][__main__][INFO] - Number of regex retries in iteration 371: 0 [2025-11-13 10:22:16,754][__main__][INFO] - agents played in iteration 371 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:22:17,193][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:17,227][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:17,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:17,294][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:17,295][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:22:17,295][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:22:17,992][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:22:18,288][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:22:18,616][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:22:18,945][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:22:19,271][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:22:19,600][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:22:19,927][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:22:20,254][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:22:20,582][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:22:20,912][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:22:21,241][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:22:21,570][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:22:21,897][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:22:22,223][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:22:22,551][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:22:22,877][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:22:23,204][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:22:23,531][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:22:23,857][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:22:24,183][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:22:24,511][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:22:24,838][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:22:25,165][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:22:25,491][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:22:25,818][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:22:26,144][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:22:26,471][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:22:26,797][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:22:27,123][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:22:27,451][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:22:27,777][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:22:28,103][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:22:28,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:22:29,240][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:22:29,958][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:22:29,959][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:22:29,961][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:22:30,962][__main__][INFO] - Iteration 372 took 22s (37.93% Gen, 57.69% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 45m 58s. Estimated total time: 19h 4m 39s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 9s, 500 more iterations: 3h 10m 46s. [2025-11-13 10:22:30,964][__main__][INFO] - Starting iteration 372. [2025-11-13 10:22:30,967][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:22:30,968][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:22:41,010][__main__][INFO] - Number of regex retries in iteration 372: 0 [2025-11-13 10:22:41,011][__main__][INFO] - agents played in iteration 372 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:22:41,434][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:41,467][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:41,499][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:41,533][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:41,533][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:22:41,533][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:22:42,240][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:22:42,537][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:22:42,866][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:22:43,195][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:22:43,520][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:22:43,850][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:22:44,180][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:22:44,510][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:22:44,835][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:22:45,162][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:22:45,489][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:22:45,817][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:22:46,145][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:22:46,471][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:22:46,797][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:22:47,123][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:22:47,449][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:22:47,775][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:22:48,101][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:22:48,428][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:22:48,755][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:22:49,082][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:22:49,409][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:22:49,735][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:22:50,062][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:22:50,390][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:22:50,718][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:22:51,046][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:22:51,372][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:22:51,703][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:22:52,032][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:22:52,361][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:22:52,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:22:53,430][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:22:54,149][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:22:54,151][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:22:54,152][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:22:55,060][__main__][INFO] - Iteration 373 took 24s (41.68% Gen, 54.54% Train). Generation: 10s, Training: 13s. Estimated remaining time: 17h 45m 37s. Estimated total time: 20h 4m 42s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 9s, 500 more iterations: 3h 20m 47s. [2025-11-13 10:22:55,062][__main__][INFO] - Starting iteration 373. [2025-11-13 10:22:55,065][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:22:55,066][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:23:04,468][__main__][INFO] - Number of regex retries in iteration 373: 0 [2025-11-13 10:23:04,469][__main__][INFO] - agents played in iteration 373 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:23:04,910][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:04,944][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:04,978][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:05,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:05,013][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:23:05,013][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:23:05,726][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:23:06,023][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:23:06,354][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:23:06,686][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:23:07,013][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:23:07,340][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:23:07,667][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:23:07,994][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:23:08,321][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:23:08,648][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:23:08,976][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:23:09,302][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:23:09,629][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:23:09,955][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:23:10,281][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:23:10,608][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:23:10,934][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:23:11,261][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:23:11,588][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:23:11,914][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:23:12,241][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:23:12,569][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:23:12,897][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:23:13,222][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:23:13,548][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:23:13,875][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:23:14,204][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:23:14,530][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:23:14,859][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:23:15,190][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:23:15,515][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:23:15,848][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:23:16,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:23:16,915][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:23:17,638][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:23:17,640][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:23:17,642][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:23:18,579][__main__][INFO] - Iteration 374 took 23s (39.99% Gen, 56.02% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 16m 14s. Estimated total time: 19h 35m 43s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 11s, 500 more iterations: 3h 15m 57s. [2025-11-13 10:23:18,581][__main__][INFO] - Starting iteration 374. [2025-11-13 10:23:18,584][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:23:18,585][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:23:27,516][__main__][INFO] - Number of regex retries in iteration 374: 0 [2025-11-13 10:23:27,516][__main__][INFO] - agents played in iteration 374 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:23:27,948][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:27,985][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:28,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:28,055][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:28,055][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:23:28,056][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:23:28,784][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:23:29,180][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:23:29,510][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:23:29,836][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:23:30,162][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:23:30,490][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:23:30,816][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:23:31,143][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:23:31,469][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:23:31,796][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:23:32,123][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:23:32,450][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:23:32,778][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:23:33,104][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:23:33,431][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:23:33,758][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:23:34,085][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:23:34,412][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:23:34,738][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:23:35,066][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:23:35,394][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:23:35,721][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:23:36,047][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:23:36,375][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:23:36,702][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:23:37,029][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:23:37,357][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:23:37,685][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:23:38,011][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:23:38,341][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:23:38,669][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:23:38,996][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:23:39,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:23:40,050][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:23:40,781][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:23:40,783][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:23:40,785][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:23:41,797][__main__][INFO] - Iteration 375 took 23s (38.47% Gen, 57.16% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 0m 50s. Estimated total time: 19h 20m 43s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 41s, 500 more iterations: 3h 13m 27s. [2025-11-13 10:23:41,800][__main__][INFO] - Starting iteration 375. [2025-11-13 10:23:41,803][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:23:41,803][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:23:51,082][__main__][INFO] - Number of regex retries in iteration 375: 0 [2025-11-13 10:23:51,083][__main__][INFO] - agents played in iteration 375 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:23:51,526][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:51,560][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:51,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:51,628][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:51,629][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:23:51,629][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:23:52,366][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:23:52,663][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:23:52,991][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:23:53,319][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:23:53,649][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:23:53,976][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:23:54,308][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:23:54,635][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:23:54,963][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:23:55,289][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:23:55,617][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:23:55,943][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:23:56,269][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:23:56,594][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:23:56,921][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:23:57,248][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:23:57,574][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:23:57,900][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:23:58,226][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:23:58,553][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:23:58,879][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:23:59,205][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:23:59,533][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:23:59,859][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:24:00,187][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:24:00,514][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:24:00,843][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:24:01,169][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:24:01,496][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:24:01,822][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:24:02,149][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:24:02,477][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:24:02,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:24:03,539][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:24:04,278][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:24:04,280][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:24:04,282][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:24:05,255][__main__][INFO] - Iteration 376 took 23s (39.57% Gen, 56.28% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 12m 21s. Estimated total time: 19h 32m 37s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 5s, 500 more iterations: 3h 15m 26s. [2025-11-13 10:24:05,257][__main__][INFO] - Starting iteration 376. [2025-11-13 10:24:05,259][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:24:05,260][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:24:14,047][__main__][INFO] - Number of regex retries in iteration 376: 0 [2025-11-13 10:24:14,048][__main__][INFO] - agents played in iteration 376 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:24:14,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:14,527][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:14,561][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:14,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:14,596][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:24:14,596][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:24:15,323][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:24:15,620][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:24:15,947][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:24:16,277][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:24:16,606][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:24:16,933][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:24:17,259][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:24:17,587][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:24:17,916][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:24:18,246][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:24:18,577][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:24:18,904][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:24:19,231][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:24:19,557][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:24:19,883][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:24:20,210][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:24:20,538][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:24:20,865][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:24:21,191][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:24:21,518][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:24:21,845][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:24:22,171][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:24:22,497][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:24:22,825][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:24:23,153][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:24:23,481][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:24:23,808][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:24:24,136][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:24:24,466][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:24:24,796][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:24:25,127][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:24:25,459][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:24:25,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:24:26,520][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:24:27,248][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:24:27,250][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:24:27,251][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:24:28,225][__main__][INFO] - Iteration 377 took 22s (38.26% Gen, 57.49% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 47m 40s. Estimated total time: 19h 8m 19s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 16s, 500 more iterations: 3h 11m 23s. [2025-11-13 10:24:28,227][__main__][INFO] - Starting iteration 377. [2025-11-13 10:24:28,230][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:24:28,231][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:24:37,328][__main__][INFO] - Number of regex retries in iteration 377: 0 [2025-11-13 10:24:37,329][__main__][INFO] - agents played in iteration 377 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:24:37,769][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:37,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:37,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:37,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:37,874][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:24:37,874][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:24:38,623][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:24:38,921][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:24:39,249][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:24:39,577][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:24:39,909][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:24:40,239][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:24:40,573][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:24:40,899][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:24:41,226][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:24:41,554][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:24:41,882][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:24:42,213][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:24:42,544][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:24:42,872][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:24:43,200][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:24:43,527][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:24:43,854][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:24:44,181][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:24:44,508][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:24:44,834][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:24:45,162][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:24:45,490][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:24:45,818][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:24:46,145][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:24:46,472][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:24:46,800][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:24:47,128][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:24:47,455][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:24:47,781][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:24:48,108][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:24:48,436][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:24:48,766][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:24:49,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:24:49,828][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:24:50,559][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:24:50,560][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:24:50,562][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:24:51,672][__main__][INFO] - Iteration 378 took 23s (38.81% Gen, 56.45% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 11m 6s. Estimated total time: 19h 32m 8s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 4s, 500 more iterations: 3h 15m 21s. [2025-11-13 10:24:51,674][__main__][INFO] - Starting iteration 378. [2025-11-13 10:24:51,677][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:24:51,677][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:25:00,423][__main__][INFO] - Number of regex retries in iteration 378: 0 [2025-11-13 10:25:00,423][__main__][INFO] - agents played in iteration 378 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:25:00,862][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:01,239][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:01,273][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:01,307][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:01,308][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:25:01,308][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:25:02,058][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:25:02,355][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:25:02,683][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:25:03,015][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:25:03,346][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:25:03,674][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:25:04,003][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:25:04,335][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:25:04,664][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:25:04,992][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:25:05,319][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:25:05,646][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:25:05,973][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:25:06,299][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:25:06,626][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:25:06,952][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:25:07,278][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:25:07,607][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:25:07,933][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:25:08,259][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:25:08,586][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:25:08,913][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:25:09,239][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:25:09,567][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:25:09,894][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:25:10,220][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:25:10,547][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:25:10,874][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:25:11,201][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:25:11,527][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:25:11,852][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:25:12,179][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:25:12,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:25:13,231][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:25:13,969][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:25:13,970][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:25:13,972][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:25:14,948][__main__][INFO] - Iteration 379 took 23s (37.58% Gen, 58.22% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 2m 9s. Estimated total time: 19h 23m 35s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 47s, 500 more iterations: 3h 13m 55s. [2025-11-13 10:25:14,950][__main__][INFO] - Starting iteration 379. [2025-11-13 10:25:14,953][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:25:14,953][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:25:23,908][__main__][INFO] - Number of regex retries in iteration 379: 0 [2025-11-13 10:25:23,908][__main__][INFO] - agents played in iteration 379 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:25:24,343][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:24,378][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:24,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:24,446][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:24,447][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:25:24,447][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:25:25,188][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:25:25,485][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:25:25,814][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:25:26,142][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:25:26,472][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:25:26,801][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:25:27,130][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:25:27,462][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:25:27,792][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:25:28,121][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:25:28,450][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:25:28,777][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:25:29,107][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:25:29,433][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:25:29,760][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:25:30,086][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:25:30,412][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:25:30,739][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:25:31,067][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:25:31,394][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:25:31,722][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:25:32,049][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:25:32,374][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:25:32,701][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:25:33,028][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:25:33,356][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:25:33,682][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:25:34,010][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:25:34,338][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:25:34,664][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:25:34,993][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:25:35,320][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:25:35,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:25:36,365][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:25:37,105][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:25:37,106][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:25:37,108][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:25:38,078][__main__][INFO] - Iteration 380 took 23s (38.72% Gen, 57.08% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 54m 28s. Estimated total time: 19h 16m 17s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 32s, 500 more iterations: 3h 12m 42s. [2025-11-13 10:25:38,080][__main__][INFO] - Starting iteration 380. [2025-11-13 10:25:38,083][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:25:38,084][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:25:47,926][__main__][INFO] - Number of regex retries in iteration 380: 0 [2025-11-13 10:25:47,927][__main__][INFO] - agents played in iteration 380 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:25:48,352][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:48,385][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:48,417][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:48,451][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:48,451][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:25:48,452][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:25:49,169][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:25:49,466][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:25:49,797][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:25:50,128][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:25:50,458][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:25:50,785][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:25:51,114][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:25:51,444][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:25:51,771][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:25:52,103][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:25:52,434][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:25:52,767][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:25:53,101][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:25:53,429][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:25:53,756][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:25:54,082][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:25:54,411][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:25:54,737][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:25:55,063][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:25:55,390][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:25:55,718][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:25:56,046][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:25:56,373][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:25:56,700][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:25:57,027][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:25:57,354][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:25:57,681][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:25:58,008][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:25:58,335][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:25:58,664][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:25:58,991][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:25:59,317][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:25:59,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:26:00,343][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:26:01,083][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:26:01,085][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:26:01,087][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:26:02,981][__main__][INFO] - Iteration 381 took 24s (39.53% Gen, 52.86% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 22m 41s. Estimated total time: 20h 44m 55s. Time estimates for 10 more iterations: 4m 8s, 100 more iterations: 41m 29s, 500 more iterations: 3h 27m 29s. [2025-11-13 10:26:02,983][__main__][INFO] - Starting iteration 381. [2025-11-13 10:26:02,986][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:26:02,987][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:26:12,541][__main__][INFO] - Number of regex retries in iteration 381: 0 [2025-11-13 10:26:12,542][__main__][INFO] - agents played in iteration 381 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:26:12,973][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:13,346][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:13,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:13,413][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:13,414][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:26:13,414][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:26:14,102][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:26:14,398][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:26:14,726][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:26:15,055][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:26:15,384][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:26:15,710][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:26:16,037][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:26:16,369][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:26:16,697][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:26:17,025][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:26:17,353][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:26:17,679][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:26:18,005][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:26:18,332][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:26:18,659][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:26:18,986][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:26:19,313][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:26:19,641][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:26:19,968][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:26:20,295][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:26:20,622][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:26:20,949][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:26:21,275][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:26:21,603][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:26:21,930][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:26:22,257][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:26:22,584][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:26:22,910][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:26:23,237][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:26:23,563][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:26:23,890][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:26:24,218][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:26:24,550][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:26:25,277][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:26:26,024][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:26:26,025][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:26:26,026][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:26:26,975][__main__][INFO] - Iteration 382 took 23s (39.83% Gen, 56.21% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 36m 51s. Estimated total time: 19h 59m 29s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 58s, 500 more iterations: 3h 19m 54s. [2025-11-13 10:26:26,977][__main__][INFO] - Starting iteration 382. [2025-11-13 10:26:26,980][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:26:26,981][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:26:36,303][__main__][INFO] - Number of regex retries in iteration 382: 0 [2025-11-13 10:26:36,303][__main__][INFO] - agents played in iteration 382 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:26:36,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:36,781][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:36,815][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:36,848][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:36,849][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:26:36,849][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:26:37,579][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:26:37,876][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:26:38,204][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:26:38,529][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:26:38,857][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:26:39,185][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:26:39,513][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:26:39,840][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:26:40,168][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:26:40,496][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:26:40,824][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:26:41,151][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:26:41,478][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:26:41,805][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:26:42,132][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:26:42,459][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:26:42,787][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:26:43,114][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:26:43,442][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:26:43,769][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:26:44,095][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:26:44,423][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:26:44,749][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:26:45,075][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:26:45,402][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:26:45,729][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:26:46,056][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:26:46,383][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:26:46,710][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:26:47,036][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:26:47,362][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:26:47,689][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:26:48,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:26:48,753][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:26:49,501][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:26:49,502][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:26:49,504][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:26:50,530][__main__][INFO] - Iteration 383 took 23s (39.58% Gen, 56.05% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 14m 31s. Estimated total time: 19h 37m 32s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 15s, 500 more iterations: 3h 16m 15s. [2025-11-13 10:26:50,533][__main__][INFO] - Starting iteration 383. [2025-11-13 10:26:50,536][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:26:50,536][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:26:59,928][__main__][INFO] - Number of regex retries in iteration 383: 0 [2025-11-13 10:26:59,929][__main__][INFO] - agents played in iteration 383 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:27:00,376][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:00,413][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:00,446][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:00,480][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:00,480][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:27:00,481][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:27:01,487][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:27:01,784][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:27:02,111][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:27:02,437][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:27:02,766][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:27:03,098][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:27:03,425][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:27:03,758][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:27:04,088][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:27:04,417][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:27:04,746][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:27:05,072][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:27:05,400][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:27:05,726][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:27:06,053][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:27:06,379][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:27:06,707][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:27:07,034][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:27:07,361][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:27:07,688][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:27:08,015][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:27:08,341][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:27:08,668][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:27:08,994][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:27:09,321][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:27:09,650][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:27:09,975][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:27:10,302][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:27:10,630][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:27:10,958][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:27:11,286][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:27:11,615][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:27:11,941][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:27:12,657][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:27:13,396][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:27:13,398][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:27:13,399][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:27:14,322][__main__][INFO] - Iteration 384 took 23s (39.49% Gen, 56.63% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 25m 54s. Estimated total time: 19h 49m 19s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 38s, 500 more iterations: 3h 18m 13s. [2025-11-13 10:27:14,323][__main__][INFO] - Starting iteration 384. [2025-11-13 10:27:14,326][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:27:14,326][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:27:23,508][__main__][INFO] - Number of regex retries in iteration 384: 0 [2025-11-13 10:27:23,509][__main__][INFO] - agents played in iteration 384 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:27:23,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:23,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:24,026][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:24,059][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:24,059][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:27:24,059][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:27:25,060][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:27:25,357][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:27:25,685][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:27:26,013][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:27:26,340][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:27:26,671][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:27:27,003][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:27:27,333][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:27:27,663][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:27:27,993][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:27:28,320][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:27:28,647][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:27:28,973][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:27:29,300][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:27:29,628][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:27:29,954][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:27:30,281][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:27:30,608][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:27:30,935][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:27:31,263][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:27:31,590][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:27:31,917][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:27:32,245][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:27:32,572][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:27:32,899][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:27:33,226][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:27:33,553][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:27:33,879][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:27:34,207][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:27:34,535][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:27:34,868][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:27:35,194][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:27:35,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:27:36,342][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:27:37,067][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:27:37,069][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:27:37,070][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:27:37,947][__main__][INFO] - Iteration 385 took 23s (38.87% Gen, 57.41% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 17m 16s. Estimated total time: 19h 41m 5s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 22s, 500 more iterations: 3h 16m 50s. [2025-11-13 10:27:37,949][__main__][INFO] - Starting iteration 385. [2025-11-13 10:27:37,952][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:27:37,953][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:27:46,675][__main__][INFO] - Number of regex retries in iteration 385: 0 [2025-11-13 10:27:46,675][__main__][INFO] - agents played in iteration 385 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:27:47,118][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:47,152][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:47,185][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:47,218][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:47,219][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:27:47,219][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:27:47,935][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:27:48,232][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:27:48,560][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:27:48,886][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:27:49,212][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:27:49,541][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:27:49,872][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:27:50,200][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:27:50,532][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:27:50,860][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:27:51,190][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:27:51,516][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:27:51,843][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:27:52,170][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:27:52,498][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:27:52,825][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:27:53,151][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:27:53,478][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:27:53,805][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:27:54,132][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:27:54,458][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:27:54,785][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:27:55,112][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:27:55,438][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:27:55,765][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:27:56,093][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:27:56,419][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:27:56,747][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:27:57,075][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:27:57,404][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:27:57,732][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:27:58,060][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:27:58,387][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:27:59,122][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:27:59,862][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:27:59,864][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:27:59,866][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:28:00,804][__main__][INFO] - Iteration 386 took 22s (38.17% Gen, 57.72% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 38m 27s. Estimated total time: 19h 2m 39s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 5s, 500 more iterations: 3h 10m 26s. [2025-11-13 10:28:00,806][__main__][INFO] - Starting iteration 386. [2025-11-13 10:28:00,809][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:28:00,809][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:28:09,598][__main__][INFO] - Number of regex retries in iteration 386: 0 [2025-11-13 10:28:09,599][__main__][INFO] - agents played in iteration 386 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:28:10,038][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:10,071][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:10,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:10,137][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:10,137][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:28:10,138][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:28:11,145][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:28:11,442][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:28:11,770][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:28:12,097][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:28:12,424][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:28:12,750][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:28:13,077][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:28:13,405][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:28:13,732][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:28:14,060][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:28:14,387][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:28:14,714][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:28:15,042][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:28:15,368][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:28:15,695][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:28:16,021][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:28:16,349][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:28:16,676][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:28:17,003][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:28:17,331][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:28:17,656][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:28:17,983][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:28:18,310][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:28:18,636][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:28:18,964][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:28:19,291][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:28:19,619][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:28:19,946][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:28:20,275][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:28:20,602][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:28:20,929][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:28:21,260][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:28:21,593][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:28:22,325][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:28:23,055][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:28:23,056][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:28:23,058][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:28:24,046][__main__][INFO] - Iteration 387 took 23s (37.83% Gen, 57.92% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 57m 17s. Estimated total time: 19h 21m 52s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 43s, 500 more iterations: 3h 13m 38s. [2025-11-13 10:28:24,048][__main__][INFO] - Starting iteration 387. [2025-11-13 10:28:24,050][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:28:24,051][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:28:33,224][__main__][INFO] - Number of regex retries in iteration 387: 0 [2025-11-13 10:28:33,225][__main__][INFO] - agents played in iteration 387 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:28:33,649][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:33,682][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:33,715][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:33,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:33,749][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:28:33,749][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:28:34,437][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:28:34,733][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:28:35,063][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:28:35,395][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:28:35,723][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:28:36,052][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:28:36,385][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:28:36,718][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:28:37,050][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:28:37,382][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:28:37,713][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:28:38,040][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:28:38,367][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:28:38,693][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:28:39,020][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:28:39,347][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:28:39,675][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:28:40,002][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:28:40,328][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:28:40,655][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:28:40,981][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:28:41,308][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:28:41,637][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:28:41,964][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:28:42,291][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:28:42,618][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:28:42,946][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:28:43,274][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:28:43,601][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:28:43,928][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:28:44,256][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:28:44,583][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:28:44,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:28:45,651][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:28:46,374][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:28:46,376][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:28:46,378][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:28:47,302][__main__][INFO] - Iteration 388 took 23s (39.45% Gen, 56.57% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 57m 38s. Estimated total time: 19h 22m 36s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 45s, 500 more iterations: 3h 13m 46s. [2025-11-13 10:28:47,304][__main__][INFO] - Starting iteration 388. [2025-11-13 10:28:47,307][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:28:47,307][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:28:56,370][__main__][INFO] - Number of regex retries in iteration 388: 0 [2025-11-13 10:28:56,370][__main__][INFO] - agents played in iteration 388 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:28:56,795][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:56,831][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:56,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:56,897][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:56,898][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:28:56,899][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:28:57,583][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:28:57,879][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:28:58,205][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:28:58,533][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:28:58,859][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:28:59,186][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:28:59,515][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:28:59,851][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:29:00,181][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:29:00,509][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:29:00,837][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:29:01,163][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:29:01,490][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:29:01,817][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:29:02,144][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:29:02,471][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:29:02,798][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:29:03,124][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:29:03,451][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:29:03,778][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:29:04,105][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:29:04,432][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:29:04,758][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:29:05,085][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:29:05,412][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:29:05,739][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:29:06,065][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:29:06,392][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:29:06,717][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:29:07,050][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:29:07,376][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:29:07,706][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:29:08,034][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:29:08,755][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:29:09,472][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:29:09,473][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:29:09,475][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:29:10,371][__main__][INFO] - Iteration 389 took 23s (39.29% Gen, 56.82% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 47m 53s. Estimated total time: 19h 13m 14s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 26s, 500 more iterations: 3h 12m 12s. [2025-11-13 10:29:10,373][__main__][INFO] - Starting iteration 389. [2025-11-13 10:29:10,375][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:29:10,376][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:29:19,542][__main__][INFO] - Number of regex retries in iteration 389: 0 [2025-11-13 10:29:19,543][__main__][INFO] - agents played in iteration 389 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:29:19,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:20,019][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:20,053][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:20,087][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:20,087][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:29:20,088][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:29:20,775][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:29:21,070][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:29:21,397][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:29:21,723][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:29:22,050][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:29:22,377][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:29:22,704][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:29:23,033][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:29:23,358][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:29:23,684][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:29:24,017][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:29:24,343][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:29:24,675][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:29:25,005][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:29:25,334][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:29:25,661][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:29:25,987][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:29:26,313][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:29:26,641][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:29:26,970][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:29:27,296][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:29:27,623][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:29:27,951][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:29:28,277][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:29:28,603][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:29:28,929][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:29:29,256][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:29:29,583][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:29:29,909][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:29:30,235][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:29:30,565][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:29:30,891][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:29:31,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:29:31,936][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:29:32,671][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:29:32,673][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:29:32,675][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:29:33,647][__main__][INFO] - Iteration 390 took 23s (39.39% Gen, 56.43% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 57m 52s. Estimated total time: 19h 23m 37s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 47s, 500 more iterations: 3h 13m 56s. [2025-11-13 10:29:33,649][__main__][INFO] - Starting iteration 390. [2025-11-13 10:29:33,652][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:29:33,652][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:29:43,049][__main__][INFO] - Number of regex retries in iteration 390: 0 [2025-11-13 10:29:43,049][__main__][INFO] - agents played in iteration 390 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:29:43,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:43,523][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:43,557][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:43,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:43,591][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:29:43,591][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:29:44,312][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:29:44,609][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:29:44,936][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:29:45,264][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:29:45,589][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:29:45,914][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:29:46,241][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:29:46,575][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:29:46,902][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:29:47,230][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:29:47,559][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:29:47,892][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:29:48,218][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:29:48,545][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:29:48,873][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:29:49,200][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:29:49,530][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:29:49,857][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:29:50,184][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:29:50,511][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:29:50,838][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:29:51,166][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:29:51,493][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:29:51,820][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:29:52,146][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:29:52,474][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:29:52,801][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:29:53,128][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:29:53,455][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:29:53,781][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:29:54,108][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:29:54,435][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:29:54,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:29:55,479][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:29:56,210][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:29:56,212][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:29:56,213][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:29:58,168][__main__][INFO] - Iteration 391 took 24s (38.33% Gen, 53.69% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 59m 42s. Estimated total time: 20h 25m 51s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 51s, 500 more iterations: 3h 24m 18s. [2025-11-13 10:29:58,170][__main__][INFO] - Starting iteration 391. [2025-11-13 10:29:58,173][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:29:58,174][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:30:07,640][__main__][INFO] - Number of regex retries in iteration 391: 0 [2025-11-13 10:30:07,641][__main__][INFO] - agents played in iteration 391 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:30:08,082][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:08,118][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:08,153][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:08,186][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:08,187][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:30:08,187][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:30:08,936][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:30:09,234][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:30:09,563][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:30:09,894][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:30:10,219][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:30:10,546][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:30:10,873][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:30:11,201][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:30:11,528][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:30:11,855][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:30:12,182][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:30:12,515][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:30:12,847][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:30:13,179][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:30:13,507][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:30:13,834][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:30:14,161][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:30:14,488][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:30:14,814][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:30:15,140][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:30:15,467][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:30:15,794][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:30:16,120][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:30:16,447][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:30:16,773][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:30:17,100][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:30:17,427][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:30:17,753][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:30:18,080][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:30:18,407][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:30:18,734][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:30:19,059][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:30:19,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:30:20,123][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:30:20,881][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:30:20,883][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:30:20,885][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:30:21,910][__main__][INFO] - Iteration 392 took 23s (39.88% Gen, 55.79% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 20m 20s. Estimated total time: 19h 46m 53s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 33s, 500 more iterations: 3h 17m 48s. [2025-11-13 10:30:21,912][__main__][INFO] - Starting iteration 392. [2025-11-13 10:30:21,915][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:30:21,915][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:30:30,775][__main__][INFO] - Number of regex retries in iteration 392: 0 [2025-11-13 10:30:30,776][__main__][INFO] - agents played in iteration 392 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:30:31,215][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:31,250][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:31,283][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:31,316][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:31,316][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:30:31,317][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:30:32,037][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:30:32,333][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:30:32,660][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:30:32,988][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:30:33,318][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:30:33,643][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:30:33,970][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:30:34,296][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:30:34,624][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:30:34,950][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:30:35,277][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:30:35,604][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:30:35,932][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:30:36,261][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:30:36,590][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:30:36,917][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:30:37,244][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:30:37,571][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:30:37,899][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:30:38,225][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:30:38,551][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:30:38,878][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:30:39,206][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:30:39,532][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:30:39,859][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:30:40,186][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:30:40,512][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:30:40,838][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:30:41,165][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:30:41,492][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:30:41,818][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:30:42,147][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:30:42,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:30:43,194][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:30:43,944][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:30:43,946][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:30:43,947][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:30:45,107][__main__][INFO] - Iteration 393 took 23s (38.20% Gen, 56.79% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 52m 44s. Estimated total time: 19h 19m 40s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 39s, 500 more iterations: 3h 13m 16s. [2025-11-13 10:30:45,110][__main__][INFO] - Starting iteration 393. [2025-11-13 10:30:45,113][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:30:45,114][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:30:54,911][__main__][INFO] - Number of regex retries in iteration 393: 0 [2025-11-13 10:30:54,912][__main__][INFO] - agents played in iteration 393 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:30:55,360][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:55,393][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:55,427][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:55,461][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:55,461][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:30:55,462][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:30:56,168][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:30:56,464][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:30:56,790][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:30:57,120][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:30:57,446][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:30:57,777][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:30:58,109][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:30:58,441][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:30:58,772][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:30:59,101][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:30:59,426][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:30:59,756][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:31:00,083][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:31:00,414][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:31:00,740][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:31:01,068][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:31:01,398][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:31:01,724][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:31:02,050][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:31:02,376][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:31:02,702][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:31:03,029][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:31:03,356][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:31:03,682][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:31:04,010][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:31:04,337][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:31:04,664][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:31:04,991][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:31:05,318][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:31:05,646][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:31:05,972][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:31:06,298][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:31:06,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:31:07,341][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:31:08,082][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:31:08,084][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:31:08,085][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:31:09,015][__main__][INFO] - Iteration 394 took 23s (40.99% Gen, 55.12% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 27m 47s. Estimated total time: 19h 55m 7s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 50s, 500 more iterations: 3h 19m 11s. [2025-11-13 10:31:09,017][__main__][INFO] - Starting iteration 394. [2025-11-13 10:31:09,020][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:31:09,021][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:31:18,557][__main__][INFO] - Number of regex retries in iteration 394: 0 [2025-11-13 10:31:18,557][__main__][INFO] - agents played in iteration 394 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:31:18,986][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:19,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:19,055][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:19,090][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:19,090][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:31:19,091][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:31:19,824][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:31:20,119][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:31:20,447][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:31:20,775][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:31:21,102][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:31:21,429][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:31:21,758][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:31:22,085][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:31:22,412][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:31:22,738][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:31:23,066][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:31:23,395][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:31:23,722][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:31:24,049][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:31:24,376][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:31:24,702][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:31:25,030][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:31:25,356][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:31:25,682][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:31:26,010][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:31:26,337][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:31:26,664][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:31:26,992][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:31:27,319][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:31:27,647][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:31:27,974][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:31:28,301][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:31:28,629][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:31:28,957][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:31:29,285][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:31:29,618][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:31:29,951][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:31:30,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:31:31,017][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:31:31,746][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:31:31,748][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:31:31,750][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:31:32,863][__main__][INFO] - Iteration 395 took 23s (39.99% Gen, 55.33% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 24m 27s. Estimated total time: 19h 52m 10s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 44s, 500 more iterations: 3h 18m 41s. [2025-11-13 10:31:32,865][__main__][INFO] - Starting iteration 395. [2025-11-13 10:31:32,868][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:31:32,869][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:31:42,453][__main__][INFO] - Number of regex retries in iteration 395: 0 [2025-11-13 10:31:42,454][__main__][INFO] - agents played in iteration 395 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:31:42,890][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:42,923][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:42,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:42,990][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:42,991][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:31:42,991][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:31:43,716][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:31:44,011][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:31:44,340][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:31:44,666][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:31:44,997][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:31:45,329][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:31:45,658][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:31:45,985][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:31:46,316][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:31:46,650][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:31:46,979][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:31:47,307][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:31:47,634][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:31:47,960][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:31:48,288][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:31:48,616][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:31:48,943][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:31:49,270][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:31:49,597][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:31:49,924][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:31:50,250][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:31:50,579][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:31:50,906][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:31:51,232][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:31:51,559][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:31:51,885][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:31:52,210][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:31:52,537][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:31:52,863][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:31:53,193][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:31:53,523][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:31:53,851][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:31:54,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:31:54,904][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:31:55,645][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:31:55,647][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:31:55,648][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:31:56,867][__main__][INFO] - Iteration 396 took 23s (39.94% Gen, 54.98% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 31m 51s. Estimated total time: 19h 59m 59s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 59s, 500 more iterations: 3h 19m 59s. [2025-11-13 10:31:56,869][__main__][INFO] - Starting iteration 396. [2025-11-13 10:31:56,874][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:31:56,875][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:32:06,099][__main__][INFO] - Number of regex retries in iteration 396: 0 [2025-11-13 10:32:06,100][__main__][INFO] - agents played in iteration 396 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:32:06,528][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:06,565][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:06,600][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:06,634][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:06,634][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:32:06,635][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:32:07,348][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:32:07,643][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:32:07,973][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:32:08,304][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:32:08,632][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:32:08,966][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:32:09,295][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:32:09,622][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:32:09,949][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:32:10,278][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:32:10,607][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:32:10,934][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:32:11,261][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:32:11,587][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:32:11,915][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:32:12,241][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:32:12,568][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:32:12,895][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:32:13,224][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:32:13,551][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:32:13,878][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:32:14,205][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:32:14,531][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:32:14,858][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:32:15,186][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:32:15,513][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:32:15,839][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:32:16,165][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:32:16,494][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:32:16,822][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:32:17,152][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:32:17,480][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:32:17,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:32:18,551][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:32:19,281][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:32:19,283][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:32:19,284][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:32:20,233][__main__][INFO] - Iteration 397 took 23s (39.49% Gen, 56.44% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 59m 29s. Estimated total time: 19h 28m 0s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 56s, 500 more iterations: 3h 14m 40s. [2025-11-13 10:32:20,236][__main__][INFO] - Starting iteration 397. [2025-11-13 10:32:20,238][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:32:20,239][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:32:30,147][__main__][INFO] - Number of regex retries in iteration 397: 0 [2025-11-13 10:32:30,147][__main__][INFO] - agents played in iteration 397 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:32:30,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:30,626][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:30,659][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:30,693][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:30,694][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:32:30,694][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:32:31,385][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:32:31,680][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:32:32,012][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:32:32,338][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:32:32,667][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:32:32,994][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:32:33,321][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:32:33,653][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:32:33,981][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:32:34,313][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:32:34,644][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:32:34,971][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:32:35,299][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:32:35,626][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:32:35,954][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:32:36,281][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:32:36,608][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:32:36,934][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:32:37,261][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:32:37,588][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:32:37,914][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:32:38,241][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:32:38,568][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:32:38,895][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:32:39,222][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:32:39,548][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:32:39,874][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:32:40,200][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:32:40,525][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:32:40,852][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:32:41,182][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:32:41,508][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:32:41,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:32:42,561][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:32:43,289][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:32:43,290][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:32:43,292][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:32:44,253][__main__][INFO] - Iteration 398 took 24s (41.26% Gen, 54.73% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 31m 52s. Estimated total time: 20h 0m 48s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 1s, 500 more iterations: 3h 20m 8s. [2025-11-13 10:32:44,256][__main__][INFO] - Starting iteration 398. [2025-11-13 10:32:44,259][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:32:44,260][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:32:52,923][__main__][INFO] - Number of regex retries in iteration 398: 0 [2025-11-13 10:32:52,924][__main__][INFO] - agents played in iteration 398 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:32:53,373][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:53,408][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:53,442][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:53,476][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:53,477][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:32:53,477][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:32:54,205][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:32:54,503][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:32:54,830][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:32:55,157][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:32:55,482][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:32:55,810][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:32:56,137][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:32:56,464][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:32:56,792][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:32:57,123][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:32:57,450][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:32:57,780][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:32:58,111][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:32:58,443][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:32:58,771][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:32:59,098][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:32:59,425][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:32:59,752][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:33:00,079][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:33:00,407][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:33:00,734][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:33:01,060][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:33:01,386][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:33:01,713][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:33:02,040][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:33:02,367][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:33:02,694][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:33:03,020][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:33:03,345][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:33:03,672][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:33:04,003][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:33:04,330][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:33:04,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:33:05,396][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:33:06,131][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:33:06,133][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:33:06,134][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:33:07,209][__main__][INFO] - Iteration 399 took 22s (37.75% Gen, 57.56% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 38m 15s. Estimated total time: 19h 7m 33s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 15s, 500 more iterations: 3h 11m 15s. [2025-11-13 10:33:07,212][__main__][INFO] - Starting iteration 399. [2025-11-13 10:33:07,215][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:33:07,216][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:33:16,982][__main__][INFO] - Number of regex retries in iteration 399: 0 [2025-11-13 10:33:16,982][__main__][INFO] - agents played in iteration 399 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:33:17,414][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:17,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:17,480][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:17,514][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:17,515][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:33:17,515][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:33:18,228][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:33:18,524][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:33:18,852][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:33:19,180][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:33:19,514][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:33:19,840][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:33:20,166][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:33:20,497][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:33:20,829][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:33:21,159][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:33:21,490][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:33:21,819][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:33:22,147][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:33:22,474][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:33:22,801][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:33:23,127][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:33:23,453][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:33:23,781][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:33:24,108][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:33:24,434][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:33:24,760][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:33:25,087][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:33:25,414][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:33:25,740][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:33:26,067][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:33:26,394][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:33:26,721][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:33:27,049][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:33:27,380][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:33:27,705][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:33:28,034][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:33:28,361][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:33:28,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:33:29,420][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:33:30,168][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:33:30,169][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:33:30,171][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:33:31,129][__main__][INFO] - Iteration 400 took 23s (40.84% Gen, 55.15% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 26m 2s. Estimated total time: 19h 55m 44s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 51s, 500 more iterations: 3h 19m 17s. [2025-11-13 10:33:31,131][__main__][INFO] - Starting iteration 400. [2025-11-13 10:33:31,134][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:33:31,135][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:33:39,804][__main__][INFO] - Number of regex retries in iteration 400: 0 [2025-11-13 10:33:39,805][__main__][INFO] - agents played in iteration 400 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:33:40,234][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:40,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:40,301][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:40,334][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:40,335][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:33:40,335][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:33:41,036][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:33:41,332][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:33:41,664][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:33:41,992][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:33:42,320][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:33:42,647][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:33:42,972][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:33:43,299][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:33:43,627][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:33:43,956][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:33:44,284][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:33:44,612][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:33:44,941][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:33:45,270][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:33:45,597][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:33:45,924][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:33:46,251][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:33:46,578][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:33:46,905][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:33:47,232][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:33:47,559][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:33:47,886][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:33:48,212][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:33:48,539][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:33:48,865][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:33:49,192][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:33:49,518][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:33:49,847][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:33:50,173][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:33:50,500][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:33:50,828][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:33:51,155][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:33:51,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:33:52,185][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:33:52,912][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:33:52,914][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:33:52,915][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:33:54,972][__main__][INFO] - Iteration 401 took 23s (36.37% Gen, 55.00% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 21m 49s. Estimated total time: 19h 51m 54s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 43s, 500 more iterations: 3h 18m 39s. [2025-11-13 10:33:54,973][__main__][INFO] - Starting iteration 401. [2025-11-13 10:33:54,977][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:33:54,977][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:34:04,392][__main__][INFO] - Number of regex retries in iteration 401: 0 [2025-11-13 10:34:04,392][__main__][INFO] - agents played in iteration 401 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:34:04,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:04,857][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:04,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:04,925][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:04,925][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:34:04,926][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:34:05,651][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:34:05,947][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:34:06,276][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:34:06,602][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:34:06,929][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:34:07,257][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:34:07,586][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:34:07,914][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:34:08,242][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:34:08,568][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:34:08,896][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:34:09,223][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:34:09,552][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:34:09,879][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:34:10,207][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:34:10,534][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:34:10,861][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:34:11,188][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:34:11,514][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:34:11,840][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:34:12,168][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:34:12,494][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:34:12,821][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:34:13,148][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:34:13,475][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:34:13,802][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:34:14,128][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:34:14,455][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:34:14,783][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:34:15,111][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:34:15,439][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:34:15,769][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:34:16,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:34:16,809][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:34:17,538][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:34:17,540][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:34:17,542][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:34:18,748][__main__][INFO] - Iteration 402 took 23s (39.61% Gen, 55.31% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 18m 7s. Estimated total time: 19h 48m 36s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 37s, 500 more iterations: 3h 18m 6s. [2025-11-13 10:34:18,751][__main__][INFO] - Starting iteration 402. [2025-11-13 10:34:18,754][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:34:18,754][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:34:27,855][__main__][INFO] - Number of regex retries in iteration 402: 0 [2025-11-13 10:34:27,856][__main__][INFO] - agents played in iteration 402 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:34:28,287][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:28,321][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:28,355][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:28,390][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:28,390][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:34:28,391][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:34:29,113][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:34:29,409][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:34:29,738][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:34:30,066][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:34:30,395][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:34:30,723][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:34:31,052][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:34:31,380][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:34:31,708][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:34:32,035][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:34:32,365][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:34:32,691][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:34:33,017][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:34:33,343][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:34:33,670][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:34:33,998][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:34:34,326][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:34:34,653][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:34:34,980][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:34:35,307][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:34:35,634][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:34:35,961][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:34:36,287][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:34:36,614][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:34:36,940][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:34:37,266][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:34:37,594][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:34:37,919][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:34:38,249][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:34:38,574][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:34:38,902][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:34:39,230][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:34:39,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:34:40,274][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:34:41,013][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:34:41,015][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:34:41,016][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:34:42,008][__main__][INFO] - Iteration 403 took 23s (39.14% Gen, 56.59% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 51m 51s. Estimated total time: 19h 22m 44s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 45s, 500 more iterations: 3h 13m 47s. [2025-11-13 10:34:42,010][__main__][INFO] - Starting iteration 403. [2025-11-13 10:34:42,013][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:34:42,014][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:34:51,170][__main__][INFO] - Number of regex retries in iteration 403: 0 [2025-11-13 10:34:51,171][__main__][INFO] - agents played in iteration 403 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:34:51,613][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:51,648][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:51,682][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:51,716][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:51,717][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:34:51,717][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:34:52,446][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:34:52,744][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:34:53,071][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:34:53,399][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:34:53,729][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:34:54,060][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:34:54,393][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:34:54,722][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:34:55,054][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:34:55,383][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:34:55,712][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:34:56,041][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:34:56,372][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:34:56,700][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:34:57,026][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:34:57,353][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:34:57,680][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:34:58,007][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:34:58,333][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:34:58,660][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:34:58,986][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:34:59,313][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:34:59,640][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:34:59,966][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:35:00,294][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:35:00,621][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:35:00,948][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:35:01,274][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:35:01,601][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:35:01,927][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:35:02,254][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:35:02,578][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:35:02,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:35:03,623][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:35:04,386][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:35:04,387][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:35:04,389][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:35:05,682][__main__][INFO] - Iteration 404 took 23s (38.69% Gen, 55.85% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 12m 11s. Estimated total time: 19h 43m 27s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 26s, 500 more iterations: 3h 17m 14s. [2025-11-13 10:35:05,683][__main__][INFO] - Starting iteration 404. [2025-11-13 10:35:05,687][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:35:05,687][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:35:14,238][__main__][INFO] - Number of regex retries in iteration 404: 0 [2025-11-13 10:35:14,239][__main__][INFO] - agents played in iteration 404 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:35:14,665][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:14,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:14,731][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:14,765][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:14,766][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:35:14,766][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:35:15,490][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:35:15,788][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:35:16,116][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:35:16,442][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:35:16,773][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:35:17,099][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:35:17,424][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:35:17,750][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:35:18,078][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:35:18,405][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:35:18,734][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:35:19,064][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:35:19,393][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:35:19,722][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:35:20,054][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:35:20,387][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:35:20,714][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:35:21,040][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:35:21,367][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:35:21,693][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:35:22,020][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:35:22,347][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:35:22,674][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:35:23,001][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:35:23,328][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:35:23,655][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:35:23,982][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:35:24,308][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:35:24,635][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:35:24,962][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:35:25,289][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:35:25,619][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:35:25,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:35:26,653][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:35:27,404][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:35:27,405][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:35:27,407][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:35:28,380][__main__][INFO] - Iteration 405 took 22s (37.68% Gen, 58.02% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 23m 2s. Estimated total time: 18h 54m 41s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 49s, 500 more iterations: 3h 9m 6s. [2025-11-13 10:35:28,382][__main__][INFO] - Starting iteration 405. [2025-11-13 10:35:28,385][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:35:28,386][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:35:37,820][__main__][INFO] - Number of regex retries in iteration 405: 0 [2025-11-13 10:35:37,821][__main__][INFO] - agents played in iteration 405 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:35:38,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:38,356][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:38,389][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:38,423][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:38,423][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:35:38,423][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:35:39,140][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:35:39,437][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:35:39,764][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:35:40,089][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:35:40,415][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:35:40,741][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:35:41,068][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:35:41,395][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:35:41,722][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:35:42,050][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:35:42,376][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:35:42,703][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:35:43,030][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:35:43,360][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:35:43,691][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:35:44,017][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:35:44,343][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:35:44,671][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:35:44,998][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:35:45,324][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:35:45,651][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:35:45,979][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:35:46,307][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:35:46,633][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:35:46,960][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:35:47,288][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:35:47,614][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:35:47,940][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:35:48,268][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:35:48,594][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:35:48,920][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:35:49,250][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:35:49,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:35:50,308][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:35:51,052][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:35:51,054][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:35:51,055][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:35:52,091][__main__][INFO] - Iteration 406 took 23s (39.80% Gen, 55.82% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 13m 17s. Estimated total time: 19h 45m 20s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 30s, 500 more iterations: 3h 17m 33s. [2025-11-13 10:35:52,094][__main__][INFO] - Starting iteration 406. [2025-11-13 10:35:52,097][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:35:52,098][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:36:01,402][__main__][INFO] - Number of regex retries in iteration 406: 0 [2025-11-13 10:36:01,403][__main__][INFO] - agents played in iteration 406 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:36:01,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:01,876][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:01,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:01,943][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:01,944][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:36:01,944][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:36:02,645][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:36:02,941][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:36:03,267][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:36:03,594][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:36:03,918][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:36:04,245][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:36:04,574][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:36:04,901][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:36:05,227][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:36:05,553][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:36:05,881][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:36:06,209][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:36:06,535][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:36:06,862][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:36:07,189][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:36:07,515][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:36:07,842][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:36:08,169][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:36:08,497][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:36:08,823][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:36:09,149][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:36:09,475][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:36:09,803][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:36:10,131][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:36:10,458][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:36:10,786][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:36:11,113][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:36:11,440][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:36:11,767][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:36:12,094][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:36:12,425][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:36:12,753][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:36:13,082][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:36:13,815][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:36:14,554][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:36:14,555][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:36:14,558][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:36:15,580][__main__][INFO] - Iteration 407 took 23s (39.62% Gen, 56.02% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 1m 43s. Estimated total time: 19h 34m 9s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 8s, 500 more iterations: 3h 15m 41s. [2025-11-13 10:36:15,582][__main__][INFO] - Starting iteration 407. [2025-11-13 10:36:15,585][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:36:15,585][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:36:25,002][__main__][INFO] - Number of regex retries in iteration 407: 0 [2025-11-13 10:36:25,003][__main__][INFO] - agents played in iteration 407 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:36:25,444][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:25,480][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:25,513][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:25,546][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:25,546][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:36:25,547][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:36:26,256][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:36:26,554][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:36:26,880][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:36:27,210][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:36:27,536][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:36:27,863][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:36:28,191][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:36:28,520][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:36:28,851][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:36:29,183][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:36:29,511][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:36:29,840][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:36:30,168][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:36:30,494][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:36:30,820][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:36:31,147][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:36:31,473][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:36:31,800][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:36:32,127][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:36:32,454][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:36:32,782][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:36:33,110][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:36:33,438][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:36:33,766][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:36:34,093][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:36:34,420][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:36:34,748][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:36:35,075][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:36:35,404][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:36:35,731][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:36:36,058][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:36:36,385][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:36:36,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:36:37,435][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:36:38,168][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:36:38,170][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:36:38,172][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:36:39,075][__main__][INFO] - Iteration 408 took 23s (40.09% Gen, 56.06% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 1m 45s. Estimated total time: 19h 34m 35s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 9s, 500 more iterations: 3h 15m 45s. [2025-11-13 10:36:39,077][__main__][INFO] - Starting iteration 408. [2025-11-13 10:36:39,080][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:36:39,081][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:36:48,210][__main__][INFO] - Number of regex retries in iteration 408: 0 [2025-11-13 10:36:48,211][__main__][INFO] - agents played in iteration 408 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:36:48,646][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:48,681][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:48,715][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:48,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:48,750][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:36:48,750][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:36:49,440][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:36:49,736][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:36:50,062][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:36:50,388][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:36:50,713][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:36:51,039][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:36:51,365][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:36:51,692][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:36:52,020][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:36:52,346][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:36:52,673][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:36:53,000][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:36:53,326][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:36:53,658][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:36:53,989][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:36:54,315][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:36:54,641][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:36:54,969][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:36:55,295][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:36:55,623][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:36:55,949][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:36:56,276][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:36:56,603][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:36:56,931][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:36:57,257][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:36:57,583][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:36:57,911][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:36:58,238][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:36:58,566][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:36:58,893][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:36:59,221][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:36:59,550][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:36:59,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:37:00,606][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:37:01,329][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:37:01,330][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:37:01,332][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:37:02,261][__main__][INFO] - Iteration 409 took 23s (39.38% Gen, 56.60% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 45m 52s. Estimated total time: 19h 19m 5s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 38s, 500 more iterations: 3h 13m 10s. [2025-11-13 10:37:02,263][__main__][INFO] - Starting iteration 409. [2025-11-13 10:37:02,266][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:37:02,267][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:37:10,783][__main__][INFO] - Number of regex retries in iteration 409: 0 [2025-11-13 10:37:10,783][__main__][INFO] - agents played in iteration 409 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:37:11,202][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:11,576][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:11,610][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:11,643][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:11,644][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:37:11,644][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:37:12,363][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:37:12,661][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:37:12,992][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:37:13,317][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:37:13,643][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:37:13,968][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:37:14,298][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:37:14,626][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:37:14,954][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:37:15,283][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:37:15,612][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:37:15,940][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:37:16,267][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:37:16,597][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:37:16,925][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:37:17,253][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:37:17,585][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:37:17,912][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:37:18,240][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:37:18,567][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:37:18,894][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:37:19,221][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:37:19,547][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:37:19,873][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:37:20,201][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:37:20,528][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:37:20,855][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:37:21,182][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:37:21,510][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:37:21,837][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:37:22,164][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:37:22,490][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:37:22,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:37:23,539][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:37:24,270][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:37:24,272][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:37:24,273][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:37:25,322][__main__][INFO] - Iteration 410 took 23s (36.94% Gen, 58.51% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 39m 15s. Estimated total time: 19h 12m 51s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 25s, 500 more iterations: 3h 12m 8s. [2025-11-13 10:37:25,325][__main__][INFO] - Starting iteration 410. [2025-11-13 10:37:25,328][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:37:25,329][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:37:34,487][__main__][INFO] - Number of regex retries in iteration 410: 0 [2025-11-13 10:37:34,488][__main__][INFO] - agents played in iteration 410 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:37:34,922][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:34,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:34,989][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:35,022][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:35,023][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:37:35,024][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:37:35,819][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:37:36,115][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:37:36,448][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:37:36,775][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:37:37,101][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:37:37,427][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:37:37,754][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:37:38,081][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:37:38,408][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:37:38,740][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:37:39,072][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:37:39,399][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:37:39,729][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:37:40,057][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:37:40,384][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:37:40,716][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:37:41,041][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:37:41,368][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:37:41,695][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:37:42,024][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:37:42,351][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:37:42,678][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:37:43,004][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:37:43,332][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:37:43,660][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:37:43,986][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:37:44,312][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:37:44,639][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:37:44,966][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:37:45,293][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:37:45,622][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:37:45,947][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:37:46,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:37:46,995][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:37:47,793][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:37:47,795][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:37:47,800][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:37:50,329][__main__][INFO] - Iteration 411 took 25s (36.63% Gen, 53.24% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 16m 4s. Estimated total time: 20h 50m 5s. Time estimates for 10 more iterations: 4m 10s, 100 more iterations: 41m 40s, 500 more iterations: 3h 28m 20s. [2025-11-13 10:37:50,331][__main__][INFO] - Starting iteration 411. [2025-11-13 10:37:50,335][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:37:50,335][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:38:00,250][__main__][INFO] - Number of regex retries in iteration 411: 0 [2025-11-13 10:38:00,251][__main__][INFO] - agents played in iteration 411 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:38:00,695][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:00,729][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:00,763][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:00,798][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:00,798][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:38:00,799][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:38:01,516][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:38:01,811][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:38:02,138][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:38:02,463][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:38:02,795][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:38:03,123][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:38:03,455][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:38:03,780][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:38:04,107][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:38:04,439][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:38:04,767][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:38:05,098][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:38:05,430][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:38:05,757][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:38:06,084][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:38:06,411][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:38:06,739][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:38:07,065][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:38:07,392][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:38:07,719][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:38:08,047][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:38:08,372][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:38:08,700][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:38:09,027][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:38:09,354][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:38:09,681][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:38:10,007][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:38:10,334][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:38:10,660][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:38:10,987][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:38:11,313][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:38:11,638][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:38:11,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:38:12,690][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:38:13,419][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:38:13,421][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:38:13,424][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:38:14,380][__main__][INFO] - Iteration 412 took 24s (41.24% Gen, 54.78% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 27m 53s. Estimated total time: 20h 2m 18s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 4s, 500 more iterations: 3h 20m 23s. [2025-11-13 10:38:14,382][__main__][INFO] - Starting iteration 412. [2025-11-13 10:38:14,385][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:38:14,386][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:38:23,329][__main__][INFO] - Number of regex retries in iteration 412: 0 [2025-11-13 10:38:23,330][__main__][INFO] - agents played in iteration 412 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:38:23,763][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:23,798][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:23,830][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:23,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:23,864][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:38:23,865][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:38:24,594][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:38:24,890][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:38:25,215][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:38:25,542][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:38:25,867][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:38:26,193][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:38:26,519][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:38:26,846][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:38:27,173][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:38:27,500][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:38:27,827][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:38:28,153][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:38:28,482][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:38:28,810][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:38:29,137][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:38:29,466][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:38:29,793][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:38:30,119][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:38:30,446][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:38:30,772][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:38:31,098][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:38:31,430][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:38:31,756][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:38:32,083][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:38:32,410][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:38:32,737][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:38:33,065][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:38:33,391][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:38:33,717][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:38:34,043][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:38:34,369][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:38:34,695][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:38:35,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:38:35,747][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:38:36,490][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:38:36,491][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:38:36,493][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:38:37,483][__main__][INFO] - Iteration 413 took 23s (38.72% Gen, 56.99% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 40m 8s. Estimated total time: 19h 14m 56s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 29s, 500 more iterations: 3h 12m 29s. [2025-11-13 10:38:37,485][__main__][INFO] - Starting iteration 413. [2025-11-13 10:38:37,488][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:38:37,488][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:38:47,107][__main__][INFO] - Number of regex retries in iteration 413: 0 [2025-11-13 10:38:47,108][__main__][INFO] - agents played in iteration 413 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:38:47,545][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:47,581][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:47,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:47,648][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:47,648][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:38:47,649][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:38:48,366][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:38:48,663][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:38:48,991][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:38:49,319][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:38:49,644][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:38:49,969][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:38:50,297][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:38:50,624][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:38:50,952][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:38:51,282][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:38:51,609][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:38:51,936][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:38:52,264][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:38:52,593][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:38:52,922][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:38:53,248][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:38:53,575][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:38:53,901][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:38:54,227][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:38:54,555][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:38:54,881][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:38:55,207][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:38:55,536][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:38:55,864][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:38:56,190][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:38:56,517][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:38:56,844][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:38:57,171][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:38:57,498][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:38:57,825][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:38:58,152][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:38:58,481][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:38:58,808][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:38:59,541][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:39:00,287][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:39:00,289][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:39:00,291][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:39:01,482][__main__][INFO] - Iteration 414 took 23s (40.09% Gen, 54.94% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 24m 32s. Estimated total time: 19h 59m 44s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 59s, 500 more iterations: 3h 19m 57s. [2025-11-13 10:39:01,484][__main__][INFO] - Starting iteration 414. [2025-11-13 10:39:01,487][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:39:01,487][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:39:10,563][__main__][INFO] - Number of regex retries in iteration 414: 0 [2025-11-13 10:39:10,564][__main__][INFO] - agents played in iteration 414 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:39:11,008][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:11,044][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:11,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:11,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:11,112][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:39:11,113][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:39:11,833][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:39:12,129][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:39:12,456][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:39:12,783][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:39:13,111][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:39:13,437][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:39:13,763][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:39:14,089][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:39:14,418][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:39:14,745][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:39:15,078][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:39:15,407][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:39:15,733][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:39:16,063][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:39:16,391][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:39:16,720][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:39:17,047][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:39:17,374][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:39:17,701][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:39:18,028][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:39:18,354][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:39:18,680][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:39:19,006][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:39:19,333][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:39:19,660][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:39:19,986][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:39:20,314][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:39:20,640][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:39:20,969][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:39:21,296][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:39:21,623][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:39:21,949][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:39:22,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:39:23,003][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:39:23,738][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:39:23,739][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:39:23,741][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:39:24,763][__main__][INFO] - Iteration 415 took 23s (38.99% Gen, 56.61% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 48m 16s. Estimated total time: 19h 23m 52s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 47s, 500 more iterations: 3h 13m 58s. [2025-11-13 10:39:24,765][__main__][INFO] - Starting iteration 415. [2025-11-13 10:39:24,768][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:39:24,769][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:39:33,580][__main__][INFO] - Number of regex retries in iteration 415: 0 [2025-11-13 10:39:33,581][__main__][INFO] - agents played in iteration 415 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:39:34,018][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:34,052][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:34,085][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:34,119][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:34,119][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:39:34,120][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:39:34,853][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:39:35,150][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:39:35,477][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:39:35,802][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:39:36,131][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:39:36,457][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:39:36,782][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:39:37,109][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:39:37,435][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:39:37,761][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:39:38,088][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:39:38,415][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:39:38,742][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:39:39,068][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:39:39,396][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:39:39,724][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:39:40,050][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:39:40,378][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:39:40,705][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:39:41,032][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:39:41,359][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:39:41,687][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:39:42,015][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:39:42,342][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:39:42,669][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:39:42,997][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:39:43,324][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:39:43,651][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:39:43,977][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:39:44,302][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:39:44,630][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:39:44,958][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:39:45,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:39:45,998][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:39:46,758][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:39:46,759][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:39:46,761][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:39:47,782][__main__][INFO] - Iteration 416 took 23s (38.29% Gen, 57.27% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 34m 45s. Estimated total time: 19h 10m 43s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 21s, 500 more iterations: 3h 11m 47s. [2025-11-13 10:39:47,784][__main__][INFO] - Starting iteration 416. [2025-11-13 10:39:47,788][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:39:47,789][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:39:57,592][__main__][INFO] - Number of regex retries in iteration 416: 0 [2025-11-13 10:39:57,593][__main__][INFO] - agents played in iteration 416 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:39:58,026][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:58,060][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:58,094][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:58,128][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:58,128][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:39:58,129][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:39:58,843][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:39:59,139][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:39:59,465][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:39:59,792][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:40:00,118][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:40:00,444][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:40:00,778][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:40:01,107][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:40:01,440][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:40:01,769][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:40:02,098][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:40:02,429][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:40:02,759][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:40:03,087][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:40:03,415][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:40:03,748][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:40:04,076][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:40:04,404][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:40:04,732][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:40:05,059][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:40:05,386][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:40:05,714][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:40:06,040][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:40:06,368][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:40:06,696][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:40:07,025][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:40:07,352][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:40:07,679][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:40:08,007][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:40:08,334][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:40:08,661][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:40:08,988][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:40:09,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:40:10,042][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:40:10,784][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:40:10,785][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:40:10,786][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:40:11,756][__main__][INFO] - Iteration 417 took 23s (40.90% Gen, 55.04% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 22m 6s. Estimated total time: 19h 58m 28s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 56s, 500 more iterations: 3h 19m 44s. [2025-11-13 10:40:11,758][__main__][INFO] - Starting iteration 417. [2025-11-13 10:40:11,762][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:40:11,762][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:40:21,046][__main__][INFO] - Number of regex retries in iteration 417: 0 [2025-11-13 10:40:21,048][__main__][INFO] - agents played in iteration 417 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:40:21,535][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:21,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:21,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:21,668][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:21,668][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:40:21,669][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:40:22,385][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:40:22,682][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:40:23,009][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:40:23,337][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:40:23,663][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:40:23,992][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:40:24,322][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:40:24,650][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:40:24,977][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:40:25,305][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:40:25,636][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:40:25,963][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:40:26,291][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:40:26,619][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:40:26,949][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:40:27,278][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:40:27,607][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:40:27,934][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:40:28,261][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:40:28,587][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:40:28,915][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:40:29,242][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:40:29,568][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:40:29,895][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:40:30,223][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:40:30,551][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:40:30,877][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:40:31,205][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:40:31,532][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:40:31,859][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:40:32,187][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:40:32,514][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:40:32,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:40:33,567][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:40:34,308][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:40:34,309][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:40:34,311][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:40:35,483][__main__][INFO] - Iteration 418 took 23s (39.14% Gen, 55.91% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 9m 23s. Estimated total time: 19h 46m 9s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 32s, 500 more iterations: 3h 17m 41s. [2025-11-13 10:40:35,485][__main__][INFO] - Starting iteration 418. [2025-11-13 10:40:35,488][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:40:35,489][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:40:44,485][__main__][INFO] - Number of regex retries in iteration 418: 0 [2025-11-13 10:40:44,486][__main__][INFO] - agents played in iteration 418 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:40:44,920][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:44,954][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:44,989][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:45,023][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:45,023][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:40:45,023][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:40:45,727][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:40:46,021][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:40:46,350][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:40:46,677][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:40:47,006][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:40:47,333][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:40:47,659][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:40:47,984][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:40:48,310][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:40:48,638][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:40:48,964][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:40:49,292][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:40:49,620][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:40:49,948][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:40:50,279][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:40:50,609][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:40:50,940][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:40:51,272][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:40:51,604][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:40:51,930][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:40:52,259][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:40:52,586][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:40:52,912][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:40:53,240][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:40:53,567][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:40:53,893][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:40:54,220][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:40:54,549][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:40:54,876][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:40:55,204][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:40:55,532][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:40:55,859][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:40:56,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:40:56,911][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:40:57,644][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:40:57,646][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:40:57,647][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:40:58,842][__main__][INFO] - Iteration 419 took 23s (38.52% Gen, 56.35% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 50m 34s. Estimated total time: 19h 27m 44s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 55s, 500 more iterations: 3h 14m 37s. [2025-11-13 10:40:58,845][__main__][INFO] - Starting iteration 419. [2025-11-13 10:40:58,847][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:40:58,848][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:41:07,839][__main__][INFO] - Number of regex retries in iteration 419: 0 [2025-11-13 10:41:07,840][__main__][INFO] - agents played in iteration 419 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:41:08,283][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:08,316][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:08,349][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:08,382][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:08,383][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:41:08,384][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:41:09,104][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:41:09,403][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:41:09,733][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:41:10,062][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:41:10,389][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:41:10,720][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:41:11,046][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:41:11,372][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:41:11,704][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:41:12,029][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:41:12,357][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:41:12,689][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:41:13,017][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:41:13,344][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:41:13,672][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:41:14,003][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:41:14,332][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:41:14,661][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:41:14,991][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:41:15,320][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:41:15,647][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:41:15,975][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:41:16,302][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:41:16,633][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:41:16,960][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:41:17,287][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:41:17,614][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:41:17,941][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:41:18,269][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:41:18,595][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:41:18,922][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:41:19,250][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:41:19,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:41:20,310][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:41:21,016][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:41:21,018][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:41:21,019][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:41:21,981][__main__][INFO] - Iteration 420 took 23s (38.87% Gen, 56.97% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 39m 11s. Estimated total time: 19h 16m 44s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 33s, 500 more iterations: 3h 12m 47s. [2025-11-13 10:41:21,987][__main__][INFO] - Starting iteration 420. [2025-11-13 10:41:21,990][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:41:21,990][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:41:31,713][__main__][INFO] - Number of regex retries in iteration 420: 0 [2025-11-13 10:41:31,713][__main__][INFO] - agents played in iteration 420 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:41:32,149][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:32,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:32,215][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:32,249][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:32,250][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:41:32,250][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:41:32,976][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:41:33,273][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:41:33,602][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:41:33,932][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:41:34,262][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:41:34,590][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:41:34,916][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:41:35,245][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:41:35,577][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:41:35,904][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:41:36,232][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:41:36,562][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:41:36,895][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:41:37,223][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:41:37,554][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:41:37,882][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:41:38,214][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:41:38,547][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:41:38,876][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:41:39,206][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:41:39,533][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:41:39,860][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:41:40,187][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:41:40,515][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:41:40,842][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:41:41,174][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:41:41,501][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:41:41,830][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:41:42,156][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:41:42,484][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:41:42,811][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:41:43,138][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:41:43,466][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:41:44,205][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:41:44,943][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:41:44,944][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:41:44,946][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:41:47,144][__main__][INFO] - Iteration 421 took 25s (38.65% Gen, 52.61% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 19m 46s. Estimated total time: 20h 57m 44s. Time estimates for 10 more iterations: 4m 11s, 100 more iterations: 41m 55s, 500 more iterations: 3h 29m 37s. [2025-11-13 10:41:47,147][__main__][INFO] - Starting iteration 421. [2025-11-13 10:41:47,149][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:41:47,150][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:41:56,648][__main__][INFO] - Number of regex retries in iteration 421: 0 [2025-11-13 10:41:56,649][__main__][INFO] - agents played in iteration 421 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:41:57,080][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:57,114][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:57,147][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:57,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:57,182][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:41:57,183][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:41:57,903][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:41:58,199][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:41:58,527][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:41:58,853][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:41:59,178][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:41:59,505][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:41:59,831][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:42:00,157][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:42:00,483][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:42:00,810][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:42:01,139][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:42:01,470][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:42:01,799][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:42:02,133][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:42:02,461][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:42:02,788][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:42:03,115][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:42:03,442][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:42:03,770][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:42:04,097][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:42:04,423][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:42:04,750][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:42:05,077][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:42:05,404][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:42:05,731][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:42:06,058][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:42:06,386][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:42:06,713][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:42:07,040][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:42:07,368][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:42:07,695][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:42:08,022][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:42:08,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:42:09,084][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:42:09,838][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:42:09,839][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:42:09,841][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:42:10,797][__main__][INFO] - Iteration 422 took 23s (40.16% Gen, 55.79% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 4m 3s. Estimated total time: 19h 42m 24s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 24s, 500 more iterations: 3h 17m 4s. [2025-11-13 10:42:10,799][__main__][INFO] - Starting iteration 422. [2025-11-13 10:42:10,802][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:42:10,802][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:42:20,632][__main__][INFO] - Number of regex retries in iteration 422: 0 [2025-11-13 10:42:20,633][__main__][INFO] - agents played in iteration 422 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:42:21,077][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:21,111][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:21,145][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:21,179][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:21,180][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:42:21,180][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:42:21,881][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:42:22,177][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:42:22,504][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:42:22,830][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:42:23,157][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:42:23,483][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:42:23,811][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:42:24,137][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:42:24,467][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:42:24,792][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:42:25,119][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:42:25,447][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:42:25,774][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:42:26,108][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:42:26,437][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:42:26,763][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:42:27,090][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:42:27,417][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:42:27,744][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:42:28,071][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:42:28,397][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:42:28,724][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:42:29,052][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:42:29,380][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:42:29,707][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:42:30,034][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:42:30,361][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:42:30,689][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:42:31,016][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:42:31,343][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:42:31,670][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:42:32,000][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:42:32,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:42:33,068][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:42:33,808][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:42:33,810][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:42:33,812][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:42:34,756][__main__][INFO] - Iteration 423 took 23s (41.04% Gen, 55.01% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 18m 59s. Estimated total time: 19h 57m 45s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 55s, 500 more iterations: 3h 19m 37s. [2025-11-13 10:42:34,758][__main__][INFO] - Starting iteration 423. [2025-11-13 10:42:34,760][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:42:34,761][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:42:43,360][__main__][INFO] - Number of regex retries in iteration 423: 0 [2025-11-13 10:42:43,361][__main__][INFO] - agents played in iteration 423 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:42:43,789][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:43,825][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:43,858][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:43,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:43,892][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:42:43,892][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:42:44,611][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:42:44,906][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:42:45,235][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:42:45,563][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:42:45,889][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:42:46,214][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:42:46,541][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:42:46,865][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:42:47,190][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:42:47,518][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:42:47,845][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:42:48,171][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:42:48,497][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:42:48,826][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:42:49,155][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:42:49,484][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:42:49,813][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:42:50,140][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:42:50,467][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:42:50,793][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:42:51,120][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:42:51,448][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:42:51,776][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:42:52,103][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:42:52,430][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:42:52,757][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:42:53,083][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:42:53,414][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:42:53,740][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:42:54,066][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:42:54,393][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:42:54,720][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:42:55,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:42:55,777][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:42:56,511][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:42:56,512][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:42:56,514][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:42:57,586][__main__][INFO] - Iteration 424 took 22s (37.67% Gen, 57.62% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 22m 11s. Estimated total time: 19h 1m 19s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 2s, 500 more iterations: 3h 10m 13s. [2025-11-13 10:42:57,588][__main__][INFO] - Starting iteration 424. [2025-11-13 10:42:57,591][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:42:57,592][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:43:07,028][__main__][INFO] - Number of regex retries in iteration 424: 0 [2025-11-13 10:43:07,029][__main__][INFO] - agents played in iteration 424 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:43:07,458][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:07,495][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:07,529][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:07,563][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:07,564][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:43:07,565][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:43:08,309][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:43:08,606][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:43:08,933][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:43:09,259][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:43:09,587][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:43:09,915][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:43:10,243][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:43:10,571][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:43:10,902][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:43:11,231][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:43:11,562][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:43:11,898][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:43:12,227][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:43:12,557][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:43:12,884][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:43:13,213][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:43:13,540][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:43:13,868][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:43:14,195][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:43:14,521][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:43:14,847][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:43:15,174][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:43:15,502][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:43:15,828][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:43:16,155][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:43:16,481][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:43:16,807][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:43:17,135][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:43:17,463][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:43:17,790][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:43:18,117][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:43:18,443][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:43:18,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:43:19,489][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:43:20,216][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:43:20,217][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:43:20,219][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:43:21,220][__main__][INFO] - Iteration 425 took 23s (39.94% Gen, 55.82% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 1m 55s. Estimated total time: 19h 41m 27s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 22s, 500 more iterations: 3h 16m 54s. [2025-11-13 10:43:21,222][__main__][INFO] - Starting iteration 425. [2025-11-13 10:43:21,226][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:43:21,226][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:43:30,413][__main__][INFO] - Number of regex retries in iteration 425: 0 [2025-11-13 10:43:30,414][__main__][INFO] - agents played in iteration 425 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:43:30,846][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:30,881][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:30,914][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:30,948][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:30,949][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:43:30,949][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:43:31,682][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:43:31,979][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:43:32,306][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:43:32,634][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:43:32,961][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:43:33,287][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:43:33,612][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:43:33,939][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:43:34,268][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:43:34,601][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:43:34,934][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:43:35,263][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:43:35,596][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:43:35,923][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:43:36,250][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:43:36,577][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:43:36,906][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:43:37,233][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:43:37,560][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:43:37,886][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:43:38,213][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:43:38,540][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:43:38,867][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:43:39,194][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:43:39,520][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:43:39,847][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:43:40,175][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:43:40,502][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:43:40,828][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:43:41,155][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:43:41,482][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:43:41,810][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:43:42,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:43:42,855][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:43:43,593][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:43:43,599][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:43:43,601][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:43:44,605][__main__][INFO] - Iteration 426 took 23s (39.30% Gen, 56.40% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 49m 5s. Estimated total time: 19h 29m 0s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 58s, 500 more iterations: 3h 14m 50s. [2025-11-13 10:43:44,608][__main__][INFO] - Starting iteration 426. [2025-11-13 10:43:44,612][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:43:44,612][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:43:54,315][__main__][INFO] - Number of regex retries in iteration 426: 0 [2025-11-13 10:43:54,316][__main__][INFO] - agents played in iteration 426 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:43:54,755][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:54,789][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:54,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:54,857][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:54,858][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:43:54,859][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:43:55,595][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:43:55,895][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:43:56,223][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:43:56,549][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:43:56,878][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:43:57,202][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:43:57,529][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:43:57,858][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:43:58,187][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:43:58,515][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:43:58,844][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:43:59,171][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:43:59,498][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:43:59,827][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:44:00,158][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:44:00,488][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:44:00,815][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:44:01,142][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:44:01,468][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:44:01,795][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:44:02,121][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:44:02,448][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:44:02,776][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:44:03,103][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:44:03,430][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:44:03,757][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:44:04,084][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:44:04,411][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:44:04,736][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:44:05,064][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:44:05,390][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:44:05,718][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:44:06,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:44:06,796][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:44:07,552][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:44:07,555][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:44:07,557][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:44:08,560][__main__][INFO] - Iteration 427 took 23s (40.52% Gen, 55.29% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 17m 8s. Estimated total time: 19h 57m 28s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 54s, 500 more iterations: 3h 19m 34s. [2025-11-13 10:44:08,562][__main__][INFO] - Starting iteration 427. [2025-11-13 10:44:08,566][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:44:08,566][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:44:17,781][__main__][INFO] - Number of regex retries in iteration 427: 0 [2025-11-13 10:44:17,782][__main__][INFO] - agents played in iteration 427 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:44:18,236][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:18,269][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:18,303][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:18,337][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:18,338][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:44:18,338][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:44:19,050][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:44:19,348][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:44:19,675][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:44:20,001][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:44:20,327][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:44:20,655][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:44:20,982][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:44:21,310][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:44:21,641][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:44:21,974][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:44:22,301][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:44:22,632][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:44:22,959][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:44:23,289][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:44:23,617][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:44:23,947][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:44:24,275][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:44:24,603][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:44:24,930][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:44:25,256][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:44:25,582][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:44:25,910][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:44:26,237][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:44:26,564][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:44:26,891][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:44:27,218][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:44:27,545][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:44:27,872][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:44:28,200][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:44:28,528][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:44:28,854][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:44:29,181][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:44:29,509][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:44:30,238][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:44:30,988][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:44:30,990][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:44:30,992][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:44:31,961][__main__][INFO] - Iteration 428 took 23s (39.39% Gen, 56.46% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 49m 6s. Estimated total time: 19h 29m 49s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 59s, 500 more iterations: 3h 14m 58s. [2025-11-13 10:44:31,964][__main__][INFO] - Starting iteration 428. [2025-11-13 10:44:31,967][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:44:31,967][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:44:41,125][__main__][INFO] - Number of regex retries in iteration 428: 0 [2025-11-13 10:44:41,126][__main__][INFO] - agents played in iteration 428 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:44:41,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:41,610][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:41,644][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:41,678][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:41,679][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:44:41,679][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:44:42,401][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:44:42,698][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:44:43,029][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:44:43,355][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:44:43,681][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:44:44,008][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:44:44,334][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:44:44,661][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:44:44,987][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:44:45,313][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:44:45,641][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:44:45,968][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:44:46,295][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:44:46,623][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:44:46,949][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:44:47,278][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:44:47,605][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:44:47,935][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:44:48,263][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:44:48,590][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:44:48,919][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:44:49,245][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:44:49,572][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:44:49,899][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:44:50,226][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:44:50,555][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:44:50,882][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:44:51,211][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:44:51,538][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:44:51,865][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:44:52,191][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:44:52,521][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:44:52,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:44:53,576][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:44:54,315][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:44:54,317][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:44:54,318][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:44:55,278][__main__][INFO] - Iteration 429 took 23s (39.28% Gen, 56.59% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 44m 30s. Estimated total time: 19h 25m 36s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 51s, 500 more iterations: 3h 14m 16s. [2025-11-13 10:44:55,280][__main__][INFO] - Starting iteration 429. [2025-11-13 10:44:55,283][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:44:55,284][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:45:05,223][__main__][INFO] - Number of regex retries in iteration 429: 0 [2025-11-13 10:45:05,224][__main__][INFO] - agents played in iteration 429 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:45:05,666][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:05,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:05,733][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:05,767][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:05,768][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:45:05,768][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:45:06,481][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:45:06,777][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:45:07,104][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:45:07,430][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:45:07,756][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:45:08,082][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:45:08,409][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:45:08,737][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:45:09,064][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:45:09,392][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:45:09,720][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:45:10,048][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:45:10,376][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:45:10,703][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:45:11,031][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:45:11,358][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:45:11,686][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:45:12,013][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:45:12,340][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:45:12,668][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:45:12,996][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:45:13,322][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:45:13,650][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:45:13,977][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:45:14,304][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:45:14,632][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:45:14,959][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:45:15,286][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:45:15,614][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:45:15,941][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:45:16,268][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:45:16,596][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:45:16,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:45:17,660][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:45:18,381][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:45:18,383][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:45:18,384][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:45:19,338][__main__][INFO] - Iteration 430 took 24s (41.32% Gen, 54.71% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 21m 16s. Estimated total time: 20h 2m 46s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 5s, 500 more iterations: 3h 20m 27s. [2025-11-13 10:45:19,340][__main__][INFO] - Starting iteration 430. [2025-11-13 10:45:19,343][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:45:19,344][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:45:28,541][__main__][INFO] - Number of regex retries in iteration 430: 0 [2025-11-13 10:45:28,542][__main__][INFO] - agents played in iteration 430 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:45:28,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:29,018][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:29,052][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:29,086][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:29,086][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:45:29,087][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:45:29,818][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:45:30,115][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:45:30,446][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:45:30,777][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:45:31,103][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:45:31,429][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:45:31,755][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:45:32,082][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:45:32,409][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:45:32,734][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:45:33,060][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:45:33,388][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:45:33,715][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:45:34,043][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:45:34,370][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:45:34,697][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:45:35,024][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:45:35,351][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:45:35,677][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:45:36,004][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:45:36,331][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:45:36,658][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:45:36,984][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:45:37,312][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:45:37,640][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:45:37,967][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:45:38,294][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:45:38,621][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:45:38,949][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:45:39,276][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:45:39,603][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:45:39,930][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:45:40,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:45:41,043][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:45:41,754][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:45:41,756][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:45:41,758][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:45:43,710][__main__][INFO] - Iteration 431 took 24s (37.75% Gen, 54.24% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 36m 29s. Estimated total time: 20h 18m 24s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 36s, 500 more iterations: 3h 23m 4s. [2025-11-13 10:45:43,712][__main__][INFO] - Starting iteration 431. [2025-11-13 10:45:43,732][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:45:43,732][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:45:53,378][__main__][INFO] - Number of regex retries in iteration 431: 0 [2025-11-13 10:45:53,379][__main__][INFO] - agents played in iteration 431 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:45:53,814][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:53,847][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:53,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:53,913][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:53,914][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:45:53,914][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:45:54,604][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:45:54,900][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:45:55,228][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:45:55,554][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:45:55,880][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:45:56,206][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:45:56,532][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:45:56,861][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:45:57,188][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:45:57,515][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:45:57,842][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:45:58,168][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:45:58,497][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:45:58,825][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:45:59,152][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:45:59,479][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:45:59,806][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:46:00,136][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:46:00,464][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:46:00,791][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:46:01,117][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:46:01,445][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:46:01,772][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:46:02,099][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:46:02,425][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:46:02,751][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:46:03,078][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:46:03,405][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:46:03,732][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:46:04,058][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:46:04,386][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:46:04,712][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:46:05,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:46:05,766][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:46:06,525][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:46:06,526][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:46:06,528][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:46:07,422][__main__][INFO] - Iteration 432 took 23s (40.69% Gen, 55.46% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 3m 6s. Estimated total time: 19h 45m 24s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 30s, 500 more iterations: 3h 17m 34s. [2025-11-13 10:46:07,424][__main__][INFO] - Starting iteration 432. [2025-11-13 10:46:07,428][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:46:07,428][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:46:16,467][__main__][INFO] - Number of regex retries in iteration 432: 0 [2025-11-13 10:46:16,468][__main__][INFO] - agents played in iteration 432 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:46:16,906][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:16,942][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:16,976][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:17,010][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:17,011][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:46:17,011][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:46:17,708][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:46:18,004][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:46:18,330][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:46:18,656][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:46:18,985][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:46:19,310][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:46:19,636][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:46:19,962][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:46:20,288][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:46:20,614][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:46:20,941][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:46:21,269][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:46:21,597][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:46:21,926][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:46:22,254][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:46:22,582][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:46:22,913][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:46:23,246][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:46:23,574][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:46:23,901][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:46:24,228][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:46:24,555][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:46:24,882][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:46:25,209][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:46:25,537][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:46:25,864][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:46:26,191][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:46:26,518][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:46:26,845][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:46:27,172][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:46:27,499][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:46:27,827][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:46:28,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:46:28,909][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:46:29,648][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:46:29,650][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:46:29,652][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:46:30,635][__main__][INFO] - Iteration 433 took 23s (38.95% Gen, 56.81% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 37m 43s. Estimated total time: 19h 20m 25s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 40s, 500 more iterations: 3h 13m 24s. [2025-11-13 10:46:30,638][__main__][INFO] - Starting iteration 433. [2025-11-13 10:46:30,642][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:46:30,642][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:46:40,128][__main__][INFO] - Number of regex retries in iteration 433: 0 [2025-11-13 10:46:40,129][__main__][INFO] - agents played in iteration 433 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:46:40,566][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:40,600][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:40,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:40,666][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:40,667][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:46:40,667][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:46:41,385][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:46:41,681][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:46:42,008][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:46:42,332][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:46:42,660][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:46:42,987][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:46:43,313][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:46:43,639][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:46:43,965][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:46:44,290][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:46:44,618][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:46:44,946][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:46:45,275][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:46:45,606][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:46:45,935][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:46:46,263][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:46:46,589][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:46:46,916][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:46:47,243][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:46:47,571][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:46:47,899][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:46:48,225][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:46:48,553][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:46:48,880][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:46:49,207][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:46:49,533][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:46:49,861][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:46:50,187][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:46:50,515][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:46:50,841][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:46:51,168][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:46:51,495][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:46:51,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:46:52,551][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:46:53,280][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:46:53,281][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:46:53,282][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:46:54,237][__main__][INFO] - Iteration 434 took 23s (40.20% Gen, 55.75% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 56m 42s. Estimated total time: 19h 39m 47s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 19s, 500 more iterations: 3h 16m 37s. [2025-11-13 10:46:54,239][__main__][INFO] - Starting iteration 434. [2025-11-13 10:46:54,242][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:46:54,243][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:47:03,345][__main__][INFO] - Number of regex retries in iteration 434: 0 [2025-11-13 10:47:03,346][__main__][INFO] - agents played in iteration 434 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:47:03,784][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:03,818][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:03,851][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:03,884][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:03,884][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:47:03,885][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:47:04,605][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:47:04,899][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:47:05,225][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:47:05,552][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:47:05,879][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:47:06,206][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:47:06,532][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:47:06,857][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:47:07,183][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:47:07,510][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:47:07,839][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:47:08,166][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:47:08,495][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:47:08,823][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:47:09,153][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:47:09,481][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:47:09,809][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:47:10,137][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:47:10,465][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:47:10,792][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:47:11,118][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:47:11,445][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:47:11,771][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:47:12,097][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:47:12,423][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:47:12,751][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:47:13,077][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:47:13,404][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:47:13,732][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:47:14,059][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:47:14,387][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:47:14,713][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:47:15,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:47:15,779][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:47:16,518][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:47:16,519][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:47:16,521][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:47:17,583][__main__][INFO] - Iteration 435 took 23s (39.00% Gen, 56.44% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 43m 36s. Estimated total time: 19h 27m 4s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 54s, 500 more iterations: 3h 14m 30s. [2025-11-13 10:47:17,585][__main__][INFO] - Starting iteration 435. [2025-11-13 10:47:17,589][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:47:17,590][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:47:26,774][__main__][INFO] - Number of regex retries in iteration 435: 0 [2025-11-13 10:47:26,775][__main__][INFO] - agents played in iteration 435 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:47:27,211][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:27,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:27,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:27,311][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:27,311][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:47:27,311][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:47:28,059][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:47:28,354][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:47:28,683][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:47:29,009][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:47:29,335][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:47:29,660][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:47:29,985][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:47:30,312][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:47:30,638][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:47:30,966][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:47:31,293][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:47:31,620][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:47:31,947][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:47:32,275][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:47:32,603][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:47:32,931][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:47:33,260][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:47:33,589][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:47:33,916][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:47:34,243][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:47:34,570][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:47:34,897][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:47:35,222][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:47:35,550][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:47:35,877][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:47:36,205][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:47:36,532][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:47:36,859][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:47:37,186][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:47:37,515][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:47:37,843][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:47:38,170][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:47:38,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:47:39,238][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:47:39,976][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:47:39,977][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:47:39,979][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:47:41,006][__main__][INFO] - Iteration 436 took 23s (39.22% Gen, 56.38% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 47m 2s. Estimated total time: 19h 30m 54s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 1s, 500 more iterations: 3h 15m 9s. [2025-11-13 10:47:41,008][__main__][INFO] - Starting iteration 436. [2025-11-13 10:47:41,011][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:47:41,012][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:47:50,339][__main__][INFO] - Number of regex retries in iteration 436: 0 [2025-11-13 10:47:50,339][__main__][INFO] - agents played in iteration 436 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:47:50,774][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:50,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:50,841][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:50,875][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:50,875][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:47:50,876][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:47:51,617][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:47:51,912][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:47:52,238][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:47:52,566][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:47:52,897][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:47:53,225][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:47:53,553][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:47:53,882][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:47:54,211][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:47:54,537][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:47:54,870][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:47:55,199][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:47:55,528][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:47:55,859][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:47:56,188][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:47:56,517][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:47:56,847][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:47:57,177][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:47:57,504][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:47:57,831][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:47:58,160][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:47:58,489][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:47:58,816][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:47:59,143][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:47:59,470][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:47:59,798][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:48:00,125][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:48:00,452][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:48:00,778][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:48:01,104][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:48:01,431][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:48:01,759][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:48:02,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:48:02,817][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:48:03,559][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:48:03,561][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:48:03,563][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:48:04,531][__main__][INFO] - Iteration 437 took 23s (39.66% Gen, 56.22% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 51m 46s. Estimated total time: 19h 36m 2s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 12s, 500 more iterations: 3h 16m 0s. [2025-11-13 10:48:04,533][__main__][INFO] - Starting iteration 437. [2025-11-13 10:48:04,537][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:48:04,537][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:48:13,798][__main__][INFO] - Number of regex retries in iteration 437: 0 [2025-11-13 10:48:13,798][__main__][INFO] - agents played in iteration 437 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:48:14,238][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:14,276][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:14,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:14,344][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:14,344][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:48:14,344][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:48:15,061][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:48:15,357][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:48:15,684][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:48:16,009][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:48:16,335][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:48:16,661][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:48:16,987][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:48:17,312][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:48:17,639][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:48:17,967][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:48:18,297][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:48:18,624][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:48:18,951][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:48:19,277][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:48:19,604][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:48:19,932][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:48:20,262][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:48:20,594][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:48:20,922][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:48:21,250][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:48:21,577][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:48:21,904][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:48:22,231][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:48:22,558][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:48:22,885][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:48:23,212][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:48:23,539][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:48:23,865][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:48:24,192][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:48:24,518][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:48:24,846][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:48:25,173][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:48:25,502][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:48:26,231][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:48:26,967][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:48:26,969][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:48:26,970][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:48:28,094][__main__][INFO] - Iteration 438 took 23s (39.31% Gen, 55.92% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 53m 16s. Estimated total time: 19h 37m 55s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 15s, 500 more iterations: 3h 16m 19s. [2025-11-13 10:48:28,096][__main__][INFO] - Starting iteration 438. [2025-11-13 10:48:28,100][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:48:28,100][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:48:37,452][__main__][INFO] - Number of regex retries in iteration 438: 0 [2025-11-13 10:48:37,453][__main__][INFO] - agents played in iteration 438 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:48:37,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:37,926][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:37,959][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:37,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:37,994][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:48:37,994][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:48:38,702][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:48:38,997][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:48:39,324][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:48:39,650][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:48:39,977][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:48:40,303][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:48:40,632][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:48:40,958][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:48:41,284][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:48:41,615][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:48:41,946][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:48:42,274][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:48:42,601][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:48:42,928][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:48:43,260][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:48:43,590][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:48:43,922][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:48:44,256][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:48:44,583][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:48:44,911][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:48:45,238][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:48:45,567][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:48:45,893][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:48:46,220][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:48:46,548][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:48:46,874][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:48:47,202][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:48:47,528][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:48:47,856][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:48:48,182][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:48:48,511][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:48:48,838][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:48:49,166][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:48:49,895][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:48:50,634][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:48:50,636][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:48:50,638][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:48:51,646][__main__][INFO] - Iteration 439 took 23s (39.72% Gen, 55.99% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 52m 20s. Estimated total time: 19h 37m 23s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 14s, 500 more iterations: 3h 16m 13s. [2025-11-13 10:48:51,649][__main__][INFO] - Starting iteration 439. [2025-11-13 10:48:51,651][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:48:51,652][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:49:00,808][__main__][INFO] - Number of regex retries in iteration 439: 0 [2025-11-13 10:49:00,808][__main__][INFO] - agents played in iteration 439 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:49:01,246][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:01,280][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:01,314][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:01,348][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:01,348][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:49:01,349][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:49:02,044][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:49:02,340][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:49:02,669][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:49:02,998][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:49:03,325][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:49:03,656][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:49:03,984][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:49:04,314][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:49:04,646][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:49:04,973][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:49:05,300][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:49:05,633][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:49:05,961][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:49:06,290][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:49:06,625][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:49:06,959][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:49:07,288][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:49:07,617][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:49:07,945][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:49:08,272][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:49:08,598][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:49:08,926][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:49:09,253][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:49:09,580][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:49:09,907][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:49:10,234][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:49:10,561][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:49:10,888][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:49:11,215][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:49:11,541][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:49:11,868][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:49:12,196][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:49:12,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:49:13,279][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:49:14,025][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:49:14,026][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:49:14,028][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:49:15,016][__main__][INFO] - Iteration 440 took 23s (39.18% Gen, 56.58% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 42m 50s. Estimated total time: 19h 28m 16s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 56s, 500 more iterations: 3h 14m 42s. [2025-11-13 10:49:15,018][__main__][INFO] - Starting iteration 440. [2025-11-13 10:49:15,021][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:49:15,022][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:49:24,625][__main__][INFO] - Number of regex retries in iteration 440: 0 [2025-11-13 10:49:24,626][__main__][INFO] - agents played in iteration 440 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:49:25,053][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:25,086][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:25,119][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:25,153][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:25,154][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:49:25,154][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:49:25,877][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:49:26,174][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:49:26,505][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:49:26,839][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:49:27,166][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:49:27,498][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:49:27,830][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:49:28,158][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:49:28,486][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:49:28,814][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:49:29,144][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:49:29,477][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:49:29,810][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:49:30,142][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:49:30,475][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:49:30,806][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:49:31,135][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:49:31,463][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:49:31,790][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:49:32,117][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:49:32,444][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:49:32,771][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:49:33,098][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:49:33,424][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:49:33,751][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:49:34,079][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:49:34,408][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:49:34,734][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:49:35,062][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:49:35,389][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:49:35,717][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:49:36,043][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:49:36,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:49:37,092][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:49:37,818][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:49:37,822][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:49:37,823][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:49:39,959][__main__][INFO] - Iteration 441 took 24s (38.51% Gen, 52.92% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 1m 4s. Estimated total time: 20h 46m 54s. Time estimates for 10 more iterations: 4m 9s, 100 more iterations: 41m 33s, 500 more iterations: 3h 27m 49s. [2025-11-13 10:49:39,961][__main__][INFO] - Starting iteration 441. [2025-11-13 10:49:39,964][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:49:39,964][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:49:49,560][__main__][INFO] - Number of regex retries in iteration 441: 0 [2025-11-13 10:49:49,560][__main__][INFO] - agents played in iteration 441 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:49:49,999][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:50,034][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:50,067][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:50,101][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:50,102][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:49:50,102][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:49:50,825][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:49:51,122][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:49:51,448][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:49:51,774][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:49:52,102][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:49:52,429][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:49:52,755][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:49:53,082][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:49:53,408][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:49:53,735][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:49:54,062][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:49:54,389][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:49:54,718][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:49:55,045][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:49:55,372][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:49:55,699][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:49:56,026][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:49:56,353][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:49:56,680][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:49:57,007][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:49:57,334][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:49:57,660][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:49:57,988][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:49:58,315][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:49:58,642][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:49:58,968][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:49:59,295][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:49:59,623][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:49:59,951][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:50:00,277][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:50:00,603][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:50:00,931][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:50:01,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:50:02,002][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:50:02,730][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:50:02,731][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:50:02,733][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:50:03,761][__main__][INFO] - Iteration 442 took 23s (40.32% Gen, 55.35% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 3m 40s. Estimated total time: 19h 49m 54s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 39s, 500 more iterations: 3h 18m 19s. [2025-11-13 10:50:03,763][__main__][INFO] - Starting iteration 442. [2025-11-13 10:50:03,766][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:50:03,767][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:50:12,572][__main__][INFO] - Number of regex retries in iteration 442: 0 [2025-11-13 10:50:12,573][__main__][INFO] - agents played in iteration 442 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:50:13,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:13,053][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:13,086][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:13,120][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:13,120][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:50:13,121][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:50:13,864][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:50:14,161][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:50:14,490][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:50:14,817][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:50:15,145][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:50:15,471][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:50:15,798][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:50:16,127][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:50:16,454][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:50:16,780][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:50:17,111][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:50:17,440][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:50:17,768][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:50:18,095][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:50:18,425][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:50:18,753][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:50:19,081][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:50:19,407][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:50:19,734][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:50:20,061][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:50:20,388][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:50:20,715][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:50:21,041][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:50:21,368][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:50:21,696][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:50:22,023][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:50:22,350][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:50:22,678][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:50:23,004][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:50:23,331][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:50:23,658][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:50:23,985][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:50:24,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:50:25,053][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:50:25,780][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:50:25,787][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:50:25,789][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:50:26,818][__main__][INFO] - Iteration 443 took 23s (38.20% Gen, 57.33% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 25m 59s. Estimated total time: 19h 12m 37s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 25s, 500 more iterations: 3h 12m 6s. [2025-11-13 10:50:26,820][__main__][INFO] - Starting iteration 443. [2025-11-13 10:50:26,823][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:50:26,824][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:50:36,605][__main__][INFO] - Number of regex retries in iteration 443: 0 [2025-11-13 10:50:36,606][__main__][INFO] - agents played in iteration 443 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:50:37,051][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:37,088][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:37,122][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:37,156][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:37,157][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:50:37,157][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:50:37,891][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:50:38,188][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:50:38,514][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:50:38,840][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:50:39,167][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:50:39,493][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:50:39,819][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:50:40,146][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:50:40,474][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:50:40,801][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:50:41,130][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:50:41,457][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:50:41,784][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:50:42,114][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:50:42,442][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:50:42,769][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:50:43,096][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:50:43,424][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:50:43,751][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:50:44,077][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:50:44,404][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:50:44,730][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:50:45,057][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:50:45,384][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:50:45,711][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:50:46,038][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:50:46,365][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:50:46,691][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:50:47,017][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:50:47,344][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:50:47,674][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:50:48,003][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:50:48,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:50:49,067][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:50:49,821][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:50:49,822][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:50:49,824][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:50:50,974][__main__][INFO] - Iteration 444 took 24s (40.50% Gen, 54.73% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 20m 31s. Estimated total time: 20h 7m 33s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 15s, 500 more iterations: 3h 21m 15s. [2025-11-13 10:50:50,976][__main__][INFO] - Starting iteration 444. [2025-11-13 10:50:50,978][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:50:50,979][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:51:00,520][__main__][INFO] - Number of regex retries in iteration 444: 0 [2025-11-13 10:51:00,521][__main__][INFO] - agents played in iteration 444 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:51:00,957][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:00,992][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:01,026][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:01,060][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:01,061][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:51:01,061][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:51:01,759][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:51:02,054][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:51:02,382][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:51:02,709][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:51:03,038][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:51:03,367][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:51:03,694][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:51:04,021][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:51:04,348][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:51:04,678][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:51:05,007][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:51:05,334][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:51:05,665][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:51:05,993][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:51:06,319][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:51:06,646][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:51:06,972][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:51:07,298][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:51:07,624][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:51:07,950][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:51:08,278][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:51:08,604][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:51:08,931][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:51:09,261][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:51:09,589][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:51:09,916][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:51:10,244][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:51:10,572][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:51:10,899][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:51:11,226][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:51:11,552][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:51:11,879][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:51:12,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:51:12,962][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:51:13,703][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:51:13,704][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:51:13,706][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:51:14,732][__main__][INFO] - Iteration 445 took 23s (40.17% Gen, 55.51% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 0m 19s. Estimated total time: 19h 47m 44s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 35s, 500 more iterations: 3h 17m 57s. [2025-11-13 10:51:14,734][__main__][INFO] - Starting iteration 445. [2025-11-13 10:51:14,737][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:51:14,737][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:51:23,931][__main__][INFO] - Number of regex retries in iteration 445: 0 [2025-11-13 10:51:23,931][__main__][INFO] - agents played in iteration 445 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:51:24,371][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:24,404][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:24,438][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:24,473][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:24,473][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:51:24,474][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:51:25,167][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:51:25,463][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:51:25,788][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:51:26,114][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:51:26,441][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:51:26,770][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:51:27,103][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:51:27,430][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:51:27,757][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:51:28,084][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:51:28,411][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:51:28,738][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:51:29,066][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:51:29,394][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:51:29,722][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:51:30,049][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:51:30,378][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:51:30,705][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:51:31,032][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:51:31,359][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:51:31,688][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:51:32,014][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:51:32,341][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:51:32,668][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:51:32,996][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:51:33,323][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:51:33,651][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:51:33,977][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:51:34,306][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:51:34,633][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:51:34,960][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:51:35,287][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:51:35,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:51:36,374][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:51:37,106][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:51:37,107][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:51:37,109][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:51:38,047][__main__][INFO] - Iteration 446 took 23s (39.44% Gen, 56.53% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 37m 44s. Estimated total time: 19h 25m 33s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 51s, 500 more iterations: 3h 14m 15s. [2025-11-13 10:51:38,049][__main__][INFO] - Starting iteration 446. [2025-11-13 10:51:38,052][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:51:38,053][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:51:47,133][__main__][INFO] - Number of regex retries in iteration 446: 0 [2025-11-13 10:51:47,134][__main__][INFO] - agents played in iteration 446 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:51:47,568][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:47,942][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:47,977][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:48,010][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:48,011][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:51:48,011][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:51:48,734][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:51:49,029][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:51:49,355][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:51:49,684][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:51:50,016][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:51:50,349][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:51:50,676][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:51:51,003][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:51:51,331][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:51:51,662][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:51:51,992][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:51:52,321][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:51:52,651][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:51:52,977][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:51:53,304][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:51:53,630][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:51:53,957][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:51:54,284][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:51:54,610][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:51:54,937][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:51:55,263][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:51:55,590][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:51:55,916][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:51:56,242][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:51:56,568][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:51:56,894][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:51:57,220][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:51:57,547][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:51:57,873][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:51:58,199][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:51:58,525][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:51:58,853][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:51:59,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:51:59,909][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:52:00,631][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:52:00,633][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:52:00,635][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:52:01,561][__main__][INFO] - Iteration 447 took 23s (38.63% Gen, 57.42% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 47m 16s. Estimated total time: 19h 35m 29s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 10s, 500 more iterations: 3h 15m 54s. [2025-11-13 10:52:01,563][__main__][INFO] - Starting iteration 447. [2025-11-13 10:52:01,566][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:52:01,567][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:52:10,781][__main__][INFO] - Number of regex retries in iteration 447: 0 [2025-11-13 10:52:10,781][__main__][INFO] - agents played in iteration 447 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:52:11,221][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:11,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:11,289][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:11,323][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:11,324][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:52:11,325][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:52:12,015][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:52:12,310][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:52:12,638][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:52:12,964][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:52:13,292][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:52:13,619][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:52:13,946][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:52:14,272][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:52:14,599][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:52:14,930][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:52:15,258][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:52:15,584][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:52:15,911][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:52:16,240][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:52:16,567][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:52:16,893][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:52:17,218][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:52:17,544][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:52:17,871][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:52:18,198][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:52:18,524][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:52:18,851][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:52:19,177][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:52:19,504][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:52:19,832][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:52:20,159][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:52:20,486][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:52:20,813][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:52:21,139][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:52:21,465][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:52:21,795][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:52:22,123][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:52:22,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:52:23,182][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:52:23,904][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:52:23,905][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:52:23,907][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:52:24,957][__main__][INFO] - Iteration 448 took 23s (39.39% Gen, 56.11% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 40m 58s. Estimated total time: 19h 29m 34s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 59s, 500 more iterations: 3h 14m 55s. [2025-11-13 10:52:24,959][__main__][INFO] - Starting iteration 448. [2025-11-13 10:52:24,962][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:52:24,962][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:52:34,078][__main__][INFO] - Number of regex retries in iteration 448: 0 [2025-11-13 10:52:34,078][__main__][INFO] - agents played in iteration 448 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:52:34,537][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:34,570][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:34,604][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:34,639][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:34,639][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:52:34,639][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:52:35,355][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:52:35,653][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:52:35,979][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:52:36,306][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:52:36,637][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:52:36,964][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:52:37,291][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:52:37,620][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:52:37,950][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:52:38,278][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:52:38,607][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:52:38,936][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:52:39,264][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:52:39,589][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:52:39,916][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:52:40,242][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:52:40,568][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:52:40,894][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:52:41,221][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:52:41,549][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:52:41,877][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:52:42,204][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:52:42,531][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:52:42,858][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:52:43,185][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:52:43,512][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:52:43,840][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:52:44,167][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:52:44,494][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:52:44,822][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:52:45,149][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:52:45,476][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:52:45,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:52:46,541][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:52:47,264][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:52:47,266][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:52:47,267][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:52:48,261][__main__][INFO] - Iteration 449 took 23s (39.12% Gen, 56.61% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 36m 1s. Estimated total time: 19h 25m 0s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 50s, 500 more iterations: 3h 14m 10s. [2025-11-13 10:52:48,264][__main__][INFO] - Starting iteration 449. [2025-11-13 10:52:48,267][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:52:48,267][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:52:57,428][__main__][INFO] - Number of regex retries in iteration 449: 0 [2025-11-13 10:52:57,429][__main__][INFO] - agents played in iteration 449 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:52:57,863][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:57,899][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:57,932][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:57,966][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:57,967][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:52:57,967][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:52:58,690][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:52:58,986][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:52:59,312][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:52:59,639][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:52:59,968][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:53:00,294][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:53:00,622][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:53:00,950][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:53:01,279][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:53:01,607][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:53:01,940][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:53:02,266][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:53:02,593][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:53:02,922][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:53:03,251][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:53:03,578][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:53:03,905][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:53:04,233][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:53:04,561][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:53:04,888][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:53:05,215][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:53:05,544][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:53:05,870][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:53:06,197][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:53:06,525][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:53:06,852][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:53:07,179][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:53:07,506][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:53:07,833][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:53:08,161][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:53:08,487][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:53:08,816][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:53:09,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:53:09,885][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:53:10,630][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:53:10,631][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:53:10,633][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:53:11,691][__main__][INFO] - Iteration 450 took 23s (39.11% Gen, 56.37% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 41m 52s. Estimated total time: 19h 31m 15s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 2s, 500 more iterations: 3h 15m 12s. [2025-11-13 10:53:11,693][__main__][INFO] - Starting iteration 450. [2025-11-13 10:53:11,696][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:53:11,697][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:53:21,185][__main__][INFO] - Number of regex retries in iteration 450: 0 [2025-11-13 10:53:21,186][__main__][INFO] - agents played in iteration 450 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:53:21,628][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:21,663][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:21,697][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:21,731][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:21,732][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:53:21,732][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:53:22,464][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:53:22,760][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:53:23,087][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:53:23,413][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:53:23,739][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:53:24,065][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:53:24,393][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:53:24,722][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:53:25,049][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:53:25,376][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:53:25,705][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:53:26,032][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:53:26,360][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:53:26,689][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:53:27,017][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:53:27,343][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:53:27,670][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:53:27,996][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:53:28,324][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:53:28,651][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:53:28,978][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:53:29,306][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:53:29,634][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:53:29,963][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:53:30,289][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:53:30,618][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:53:30,946][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:53:31,274][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:53:31,601][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:53:31,928][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:53:32,255][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:53:32,581][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:53:32,908][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:53:33,622][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:53:34,346][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:53:34,348][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:53:34,349][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:53:36,319][__main__][INFO] - Iteration 451 took 24s (38.54% Gen, 53.46% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 41m 24s. Estimated total time: 20h 31m 12s. Time estimates for 10 more iterations: 4m 6s, 100 more iterations: 41m 2s, 500 more iterations: 3h 25m 12s. [2025-11-13 10:53:36,322][__main__][INFO] - Starting iteration 451. [2025-11-13 10:53:36,325][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:53:36,326][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:53:45,839][__main__][INFO] - Number of regex retries in iteration 451: 0 [2025-11-13 10:53:45,840][__main__][INFO] - agents played in iteration 451 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:53:46,278][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:46,312][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:46,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:46,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:46,379][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:53:46,380][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:53:47,098][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:53:47,394][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:53:47,719][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:53:48,045][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:53:48,372][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:53:48,699][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:53:49,026][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:53:49,353][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:53:49,680][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:53:50,006][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:53:50,334][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:53:50,661][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:53:50,989][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:53:51,315][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:53:51,642][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:53:51,968][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:53:52,295][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:53:52,623][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:53:52,951][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:53:53,277][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:53:53,604][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:53:53,930][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:53:54,258][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:53:54,585][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:53:54,912][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:53:55,239][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:53:55,566][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:53:55,893][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:53:56,220][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:53:56,546][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:53:56,874][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:53:57,203][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:53:57,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:53:58,257][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:53:58,990][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:53:58,991][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:53:58,993][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:53:59,959][__main__][INFO] - Iteration 452 took 23s (40.26% Gen, 55.65% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 51m 33s. Estimated total time: 19h 41m 43s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 23s, 500 more iterations: 3h 16m 57s. [2025-11-13 10:53:59,961][__main__][INFO] - Starting iteration 452. [2025-11-13 10:53:59,965][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:53:59,966][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:54:09,596][__main__][INFO] - Number of regex retries in iteration 452: 0 [2025-11-13 10:54:09,597][__main__][INFO] - agents played in iteration 452 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:54:10,064][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:10,097][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:10,130][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:10,163][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:10,164][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:54:10,165][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:54:10,869][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:54:11,167][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:54:11,494][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:54:11,822][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:54:12,150][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:54:12,482][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:54:12,810][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:54:13,140][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:54:13,473][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:54:13,802][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:54:14,134][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:54:14,461][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:54:14,789][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:54:15,115][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:54:15,442][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:54:15,769][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:54:16,096][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:54:16,423][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:54:16,749][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:54:17,076][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:54:17,402][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:54:17,730][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:54:18,056][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:54:18,384][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:54:18,711][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:54:19,037][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:54:19,366][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:54:19,696][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:54:20,023][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:54:20,351][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:54:20,679][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:54:21,005][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:54:21,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:54:22,069][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:54:22,793][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:54:22,794][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:54:22,796][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:54:23,702][__main__][INFO] - Iteration 453 took 23s (40.57% Gen, 55.60% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 56m 21s. Estimated total time: 19h 46m 55s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 33s, 500 more iterations: 3h 17m 49s. [2025-11-13 10:54:23,705][__main__][INFO] - Starting iteration 453. [2025-11-13 10:54:23,707][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:54:23,708][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:54:32,920][__main__][INFO] - Number of regex retries in iteration 453: 0 [2025-11-13 10:54:32,920][__main__][INFO] - agents played in iteration 453 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:54:33,358][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:33,391][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:33,423][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:33,457][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:33,457][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:54:33,457][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:54:34,168][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:54:34,463][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:54:34,790][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:54:35,122][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:54:35,455][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:54:35,783][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:54:36,115][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:54:36,442][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:54:36,773][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:54:37,101][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:54:37,430][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:54:37,762][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:54:38,089][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:54:38,416][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:54:38,743][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:54:39,070][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:54:39,397][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:54:39,724][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:54:40,050][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:54:40,376][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:54:40,703][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:54:41,031][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:54:41,358][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:54:41,685][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:54:42,012][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:54:42,339][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:54:42,666][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:54:42,993][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:54:43,320][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:54:43,646][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:54:43,975][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:54:44,303][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:54:44,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:54:45,377][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:54:46,105][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:54:46,107][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:54:46,108][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:54:47,094][__main__][INFO] - Iteration 454 took 23s (39.39% Gen, 56.39% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 38m 24s. Estimated total time: 19h 29m 21s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 58s, 500 more iterations: 3h 14m 53s. [2025-11-13 10:54:47,096][__main__][INFO] - Starting iteration 454. [2025-11-13 10:54:47,098][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:54:47,099][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:54:56,490][__main__][INFO] - Number of regex retries in iteration 454: 0 [2025-11-13 10:54:56,491][__main__][INFO] - agents played in iteration 454 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:54:56,923][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:56,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:56,989][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:57,022][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:57,023][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:54:57,024][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:54:57,721][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:54:58,016][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:54:58,344][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:54:58,673][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:54:58,999][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:54:59,327][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:54:59,656][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:54:59,987][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:55:00,315][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:55:00,642][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:55:00,973][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:55:01,300][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:55:01,628][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:55:01,955][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:55:02,282][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:55:02,609][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:55:02,936][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:55:03,262][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:55:03,589][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:55:03,917][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:55:04,244][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:55:04,572][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:55:04,899][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:55:05,226][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:55:05,554][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:55:05,881][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:55:06,209][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:55:06,537][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:55:06,866][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:55:07,194][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:55:07,522][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:55:07,852][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:55:08,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:55:08,899][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:55:09,620][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:55:09,669][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:55:09,671][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:55:10,799][__main__][INFO] - Iteration 455 took 23s (39.62% Gen, 55.61% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 53m 43s. Estimated total time: 19h 45m 5s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 30s, 500 more iterations: 3h 17m 30s. [2025-11-13 10:55:10,801][__main__][INFO] - Starting iteration 455. [2025-11-13 10:55:10,804][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:55:10,805][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:55:19,332][__main__][INFO] - Number of regex retries in iteration 455: 0 [2025-11-13 10:55:19,333][__main__][INFO] - agents played in iteration 455 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:55:19,768][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:19,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:19,837][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:19,871][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:19,872][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:55:19,872][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:55:20,566][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:55:20,863][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:55:21,191][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:55:21,518][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:55:21,845][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:55:22,172][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:55:22,501][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:55:22,828][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:55:23,157][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:55:23,485][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:55:23,814][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:55:24,140][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:55:24,467][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:55:24,795][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:55:25,127][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:55:25,456][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:55:25,784][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:55:26,111][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:55:26,437][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:55:26,764][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:55:27,091][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:55:27,419][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:55:27,747][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:55:28,074][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:55:28,402][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:55:28,728][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:55:29,055][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:55:29,382][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:55:29,709][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:55:30,036][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:55:30,363][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:55:30,694][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:55:31,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:55:31,758][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:55:32,492][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:55:32,494][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:55:32,495][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:55:33,551][__main__][INFO] - Iteration 456 took 22s (37.49% Gen, 57.86% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 5m 40s. Estimated total time: 18h 57m 24s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 54s, 500 more iterations: 3h 9m 34s. [2025-11-13 10:55:33,553][__main__][INFO] - Starting iteration 456. [2025-11-13 10:55:33,556][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:55:33,557][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:55:42,512][__main__][INFO] - Number of regex retries in iteration 456: 0 [2025-11-13 10:55:42,513][__main__][INFO] - agents played in iteration 456 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:55:42,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:42,990][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:43,023][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:43,056][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:43,057][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:55:43,057][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:55:43,794][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:55:44,093][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:55:44,421][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:55:44,749][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:55:45,080][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:55:45,410][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:55:45,737][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:55:46,068][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:55:46,396][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:55:46,723][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:55:47,049][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:55:47,379][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:55:47,712][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:55:48,040][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:55:48,367][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:55:48,694][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:55:49,021][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:55:49,348][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:55:49,676][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:55:50,002][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:55:50,327][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:55:50,654][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:55:50,981][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:55:51,308][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:55:51,635][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:55:51,963][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:55:52,290][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:55:52,616][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:55:52,943][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:55:53,271][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:55:53,599][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:55:53,931][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:55:54,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:55:55,013][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:55:55,747][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:55:55,749][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:55:55,750][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:55:56,785][__main__][INFO] - Iteration 457 took 23s (38.55% Gen, 56.98% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 29m 22s. Estimated total time: 19h 21m 30s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 43s, 500 more iterations: 3h 13m 35s. [2025-11-13 10:55:56,787][__main__][INFO] - Starting iteration 457. [2025-11-13 10:55:56,791][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:55:56,791][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:56:05,892][__main__][INFO] - Number of regex retries in iteration 457: 0 [2025-11-13 10:56:05,893][__main__][INFO] - agents played in iteration 457 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:56:06,332][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:06,370][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:06,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:06,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:06,787][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:56:06,787][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:56:07,486][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:56:07,782][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:56:08,110][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:56:08,435][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:56:08,762][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:56:09,092][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:56:09,423][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:56:09,752][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:56:10,078][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:56:10,408][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:56:10,737][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:56:11,067][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:56:11,395][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:56:11,722][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:56:12,049][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:56:12,377][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:56:12,705][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:56:13,033][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:56:13,360][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:56:13,687][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:56:14,014][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:56:14,342][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:56:14,669][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:56:14,996][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:56:15,323][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:56:15,651][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:56:15,978][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:56:16,304][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:56:16,633][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:56:16,960][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:56:17,287][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:56:17,616][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:56:17,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:56:18,676][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:56:19,408][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:56:19,409][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:56:19,411][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:56:20,392][__main__][INFO] - Iteration 458 took 23s (38.56% Gen, 57.28% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 47m 36s. Estimated total time: 19h 40m 7s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 20s, 500 more iterations: 3h 16m 41s. [2025-11-13 10:56:20,394][__main__][INFO] - Starting iteration 458. [2025-11-13 10:56:20,397][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:56:20,398][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:56:29,911][__main__][INFO] - Number of regex retries in iteration 458: 0 [2025-11-13 10:56:29,912][__main__][INFO] - agents played in iteration 458 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:56:30,338][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:30,371][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:30,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:30,439][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:30,439][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:56:30,440][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:56:31,135][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:56:31,432][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:56:31,760][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:56:32,089][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:56:32,415][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:56:32,742][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:56:33,069][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:56:33,396][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:56:33,724][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:56:34,052][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:56:34,383][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:56:34,711][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:56:35,040][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:56:35,368][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:56:35,695][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:56:36,021][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:56:36,347][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:56:36,673][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:56:37,000][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:56:37,328][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:56:37,654][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:56:37,981][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:56:38,308][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:56:38,635][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:56:38,962][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:56:39,288][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:56:39,615][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:56:39,942][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:56:40,269][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:56:40,598][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:56:40,927][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:56:41,253][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:56:41,582][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:56:42,333][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:56:43,045][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:56:43,047][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:56:43,048][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:56:44,006][__main__][INFO] - Iteration 459 took 23s (40.30% Gen, 55.64% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 47m 33s. Estimated total time: 19h 40m 28s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 20s, 500 more iterations: 3h 16m 44s. [2025-11-13 10:56:44,008][__main__][INFO] - Starting iteration 459. [2025-11-13 10:56:44,011][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:56:44,011][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:56:53,303][__main__][INFO] - Number of regex retries in iteration 459: 0 [2025-11-13 10:56:53,304][__main__][INFO] - agents played in iteration 459 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:56:53,732][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:53,765][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:53,798][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:53,831][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:53,832][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:56:53,832][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:56:54,521][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:56:54,816][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:56:55,143][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:56:55,469][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:56:55,795][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:56:56,123][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:56:56,450][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:56:56,779][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:56:57,105][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:56:57,431][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:56:57,758][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:56:58,087][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:56:58,416][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:56:58,743][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:56:59,070][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:56:59,398][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:56:59,726][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:57:00,054][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:57:00,381][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:57:00,707][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:57:01,035][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:57:01,362][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:57:01,689][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:57:02,016][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:57:02,344][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:57:02,671][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:57:02,998][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:57:03,325][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:57:03,652][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:57:03,979][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:57:04,307][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:57:04,637][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:57:04,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:57:05,698][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:57:06,429][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:57:06,431][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:57:06,433][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:57:07,360][__main__][INFO] - Iteration 460 took 23s (39.79% Gen, 56.23% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 34m 13s. Estimated total time: 19h 27m 31s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 55s, 500 more iterations: 3h 14m 35s. [2025-11-13 10:57:07,362][__main__][INFO] - Starting iteration 460. [2025-11-13 10:57:07,365][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:57:07,366][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:57:16,772][__main__][INFO] - Number of regex retries in iteration 460: 0 [2025-11-13 10:57:16,773][__main__][INFO] - agents played in iteration 460 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:57:17,211][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:17,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:17,278][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:17,312][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:17,313][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:57:17,313][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:57:18,012][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:57:18,308][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:57:18,637][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:57:18,966][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:57:19,296][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:57:19,626][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:57:19,955][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:57:20,288][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:57:20,616][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:57:20,946][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:57:21,274][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:57:21,601][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:57:21,930][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:57:22,259][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:57:22,593][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:57:22,921][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:57:23,249][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:57:23,575][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:57:23,901][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:57:24,228][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:57:24,554][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:57:24,882][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:57:25,209][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:57:25,536][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:57:25,862][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:57:26,189][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:57:26,516][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:57:26,842][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:57:27,169][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:57:27,496][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:57:27,824][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:57:28,151][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:57:28,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:57:29,239][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:57:29,966][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:57:29,968][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:57:29,969][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:57:31,854][__main__][INFO] - Iteration 461 took 24s (38.41% Gen, 53.89% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 30m 47s. Estimated total time: 20h 24m 30s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 49s, 500 more iterations: 3h 24m 5s. [2025-11-13 10:57:31,857][__main__][INFO] - Starting iteration 461. [2025-11-13 10:57:31,861][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:57:31,861][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:57:40,639][__main__][INFO] - Number of regex retries in iteration 461: 0 [2025-11-13 10:57:40,640][__main__][INFO] - agents played in iteration 461 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:57:41,076][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:41,109][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:41,142][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:41,175][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:41,176][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:57:41,176][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:57:41,870][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:57:42,165][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:57:42,493][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:57:42,822][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:57:43,148][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:57:43,475][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:57:43,804][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:57:44,131][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:57:44,465][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:57:44,797][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:57:45,129][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:57:45,456][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:57:45,785][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:57:46,112][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:57:46,444][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:57:46,772][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:57:47,098][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:57:47,425][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:57:47,753][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:57:48,080][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:57:48,407][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:57:48,734][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:57:49,061][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:57:49,389][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:57:49,715][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:57:50,041][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:57:50,368][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:57:50,695][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:57:51,022][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:57:51,350][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:57:51,678][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:57:52,006][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:57:52,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:57:53,063][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:57:53,786][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:57:53,788][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:57:53,789][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:57:54,832][__main__][INFO] - Iteration 462 took 22s (38.21% Gen, 57.24% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 14m 30s. Estimated total time: 19h 8m 35s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 17s, 500 more iterations: 3h 11m 25s. [2025-11-13 10:57:54,834][__main__][INFO] - Starting iteration 462. [2025-11-13 10:57:54,836][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:57:54,837][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:58:04,180][__main__][INFO] - Number of regex retries in iteration 462: 0 [2025-11-13 10:58:04,180][__main__][INFO] - agents played in iteration 462 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:58:04,604][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:04,637][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:04,670][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:04,704][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:04,704][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:58:04,704][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:58:05,396][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:58:05,692][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:58:06,020][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:58:06,346][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:58:06,674][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:58:07,001][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:58:07,329][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:58:07,660][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:58:07,988][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:58:08,317][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:58:08,647][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:58:08,975][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:58:09,306][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:58:09,638][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:58:09,971][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:58:10,298][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:58:10,626][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:58:10,954][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:58:11,280][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:58:11,608][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:58:11,935][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:58:12,264][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:58:12,591][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:58:12,918][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:58:13,243][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:58:13,569][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:58:13,897][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:58:14,223][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:58:14,551][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:58:14,878][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:58:15,205][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:58:15,532][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:58:15,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:58:16,589][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:58:17,281][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:58:17,283][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:58:17,284][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:58:18,215][__main__][INFO] - Iteration 463 took 23s (39.96% Gen, 56.05% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 34m 28s. Estimated total time: 19h 28m 57s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 57s, 500 more iterations: 3h 14m 49s. [2025-11-13 10:58:18,217][__main__][INFO] - Starting iteration 463. [2025-11-13 10:58:18,220][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:58:18,220][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:58:27,020][__main__][INFO] - Number of regex retries in iteration 463: 0 [2025-11-13 10:58:27,021][__main__][INFO] - agents played in iteration 463 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:58:27,460][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:27,494][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:27,527][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:27,560][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:27,560][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:58:27,561][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:58:28,250][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:58:28,547][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:58:28,875][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:58:29,200][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:58:29,524][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:58:29,850][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:58:30,175][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:58:30,503][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:58:30,829][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:58:31,154][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:58:31,482][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:58:31,809][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:58:32,138][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:58:32,469][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:58:32,798][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:58:33,124][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:58:33,455][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:58:33,786][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:58:34,115][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:58:34,442][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:58:34,768][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:58:35,095][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:58:35,423][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:58:35,749][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:58:36,077][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:58:36,404][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:58:36,731][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:58:37,058][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:58:37,385][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:58:37,711][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:58:38,038][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:58:38,366][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:58:38,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:58:39,441][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:58:40,173][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:58:40,175][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:58:40,177][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:58:41,093][__main__][INFO] - Iteration 464 took 22s (38.47% Gen, 57.52% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 8m 51s. Estimated total time: 19h 3m 43s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 7s, 500 more iterations: 3h 10m 37s. [2025-11-13 10:58:41,095][__main__][INFO] - Starting iteration 464. [2025-11-13 10:58:41,098][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:58:41,098][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:58:50,733][__main__][INFO] - Number of regex retries in iteration 464: 0 [2025-11-13 10:58:50,734][__main__][INFO] - agents played in iteration 464 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:58:51,168][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:51,203][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:51,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:51,270][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:51,270][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:58:51,271][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:58:51,959][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:58:52,256][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:58:52,581][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:58:52,910][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:58:53,239][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:58:53,565][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:58:53,891][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:58:54,218][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:58:54,546][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:58:54,874][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:58:55,201][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:58:55,529][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:58:55,855][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:58:56,183][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:58:56,509][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:58:56,840][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:58:57,167][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:58:57,495][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:58:57,822][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:58:58,149][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:58:58,476][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:58:58,802][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:58:59,131][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:58:59,458][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:58:59,785][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:59:00,112][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:59:00,438][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:59:00,766][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:59:01,092][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:59:01,420][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:59:01,747][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:59:02,076][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:59:02,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:59:03,143][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:59:03,865][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:59:03,867][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:59:03,868][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:59:04,806][__main__][INFO] - Iteration 465 took 23s (40.64% Gen, 55.40% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 50m 12s. Estimated total time: 19h 45m 28s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 30s, 500 more iterations: 3h 17m 34s. [2025-11-13 10:59:04,809][__main__][INFO] - Starting iteration 465. [2025-11-13 10:59:04,811][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:59:04,812][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:59:14,338][__main__][INFO] - Number of regex retries in iteration 465: 0 [2025-11-13 10:59:14,338][__main__][INFO] - agents played in iteration 465 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:59:14,761][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:14,797][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:14,830][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:14,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:14,865][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:59:14,865][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:59:15,558][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:59:15,854][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:59:16,181][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:59:16,509][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:59:16,836][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:59:17,161][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:59:17,487][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:59:17,816][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:59:18,145][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:59:18,473][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:59:18,799][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:59:19,131][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:59:19,457][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:59:19,785][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:59:20,114][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:59:20,445][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:59:20,772][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:59:21,099][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:59:21,426][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:59:21,753][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:59:22,080][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:59:22,406][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:59:22,733][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:59:23,061][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:59:23,387][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:59:23,713][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:59:24,041][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:59:24,368][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:59:24,696][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:59:25,022][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:59:25,349][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:59:25,676][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:59:26,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:59:26,728][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:59:27,457][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:59:27,458][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:59:27,460][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:59:28,487][__main__][INFO] - Iteration 466 took 23s (40.24% Gen, 55.42% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 48m 10s. Estimated total time: 19h 43m 49s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 27s, 500 more iterations: 3h 17m 18s. [2025-11-13 10:59:28,489][__main__][INFO] - Starting iteration 466. [2025-11-13 10:59:28,492][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:59:28,492][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:59:37,491][__main__][INFO] - Number of regex retries in iteration 466: 0 [2025-11-13 10:59:37,491][__main__][INFO] - agents played in iteration 466 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 10:59:37,930][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:37,964][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:37,997][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:38,030][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:38,031][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:59:38,031][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:59:38,728][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:59:39,023][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:59:39,351][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:59:39,677][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:59:40,003][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:59:40,329][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:59:40,655][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:59:40,982][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:59:41,315][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:59:41,642][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:59:41,970][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:59:42,298][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:59:42,623][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:59:42,950][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:59:43,278][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:59:43,607][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:59:43,935][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:59:44,261][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:59:44,587][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:59:44,914][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:59:45,242][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:59:45,568][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:59:45,895][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:59:46,222][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:59:46,550][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:59:46,877][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:59:47,204][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:59:47,529][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:59:47,858][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:59:48,183][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:59:48,510][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:59:48,836][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:59:49,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:59:49,895][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:59:50,613][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:59:50,807][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:59:50,809][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:59:51,774][__main__][INFO] - Iteration 467 took 23s (38.65% Gen, 57.20% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 28m 7s. Estimated total time: 19h 24m 9s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 48s, 500 more iterations: 3h 14m 1s. [2025-11-13 10:59:51,776][__main__][INFO] - Starting iteration 467. [2025-11-13 10:59:51,779][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:59:51,780][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:00:00,899][__main__][INFO] - Number of regex retries in iteration 467: 0 [2025-11-13 11:00:00,900][__main__][INFO] - agents played in iteration 467 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 11:00:01,342][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:01,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:01,413][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:01,447][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:01,448][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:00:01,448][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:00:02,143][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:00:02,439][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:00:02,770][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:00:03,097][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:00:03,427][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:00:03,757][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:00:04,088][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:00:04,416][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:00:04,745][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:00:05,074][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:00:05,402][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:00:05,731][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:00:06,064][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:00:06,392][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:00:06,718][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:00:07,045][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:00:07,373][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:00:07,699][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:00:08,027][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:00:08,354][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:00:08,681][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:00:09,008][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:00:09,336][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:00:09,662][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:00:09,988][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:00:10,315][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:00:10,643][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:00:10,969][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:00:11,296][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:00:11,622][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:00:11,950][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:00:12,279][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:00:12,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:00:13,335][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:00:14,067][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:00:14,069][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:00:14,070][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:00:15,120][__main__][INFO] - Iteration 468 took 23s (39.07% Gen, 56.42% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 30m 39s. Estimated total time: 19h 27m 5s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 54s, 500 more iterations: 3h 14m 30s. [2025-11-13 11:00:15,122][__main__][INFO] - Starting iteration 468. [2025-11-13 11:00:15,125][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 11:00:15,125][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:00:24,651][__main__][INFO] - Number of regex retries in iteration 468: 0 [2025-11-13 11:00:24,652][__main__][INFO] - agents played in iteration 468 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 11:00:25,089][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:25,123][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:25,157][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:25,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:25,192][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:00:25,192][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:00:25,882][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:00:26,180][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:00:26,507][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:00:26,832][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:00:27,159][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:00:27,486][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:00:27,812][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:00:28,143][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:00:28,469][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:00:28,797][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:00:29,124][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:00:29,451][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:00:29,777][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:00:30,105][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:00:30,434][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:00:30,761][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:00:31,090][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:00:31,417][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:00:31,744][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:00:32,072][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:00:32,399][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:00:32,726][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:00:33,053][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:00:33,380][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:00:33,708][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:00:34,035][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:00:34,362][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:00:34,689][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:00:35,016][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:00:35,343][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:00:35,671][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:00:35,998][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:00:36,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:00:37,061][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:00:37,789][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:00:37,791][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:00:37,792][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:00:38,776][__main__][INFO] - Iteration 469 took 23s (40.27% Gen, 55.56% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 45m 48s. Estimated total time: 19h 42m 38s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 25s, 500 more iterations: 3h 17m 6s. [2025-11-13 11:00:38,778][__main__][INFO] - Starting iteration 469. [2025-11-13 11:00:38,781][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 11:00:38,782][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:00:47,525][__main__][INFO] - Number of regex retries in iteration 469: 0 [2025-11-13 11:00:47,526][__main__][INFO] - agents played in iteration 469 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 11:00:47,961][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:47,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:48,029][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:48,063][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:48,063][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:00:48,064][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:00:48,775][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:00:49,071][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:00:49,397][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:00:49,725][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:00:50,052][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:00:50,379][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:00:50,706][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:00:51,033][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:00:51,362][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:00:51,693][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:00:52,024][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:00:52,352][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:00:52,681][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:00:53,011][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:00:53,343][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:00:53,673][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:00:54,005][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:00:54,332][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:00:54,659][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:00:54,987][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:00:55,313][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:00:55,640][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:00:55,967][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:00:56,294][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:00:56,621][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:00:56,949][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:00:57,275][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:00:57,604][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:00:57,932][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:00:58,259][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:00:58,586][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:00:58,914][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:00:59,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:00:59,968][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:01:00,685][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:01:00,687][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:01:00,688][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:01:01,639][__main__][INFO] - Iteration 470 took 22s (38.25% Gen, 57.58% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 5m 43s. Estimated total time: 19h 2m 55s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 5s, 500 more iterations: 3h 10m 29s. [2025-11-13 11:01:01,641][__main__][INFO] - Starting iteration 470. [2025-11-13 11:01:01,644][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 11:01:01,645][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:01:11,657][__main__][INFO] - Number of regex retries in iteration 470: 0 [2025-11-13 11:01:11,657][__main__][INFO] - agents played in iteration 470 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 11:01:12,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:12,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:12,170][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:12,204][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:12,204][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:01:12,205][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:01:12,934][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:01:13,230][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:01:13,559][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:01:13,888][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:01:14,217][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:01:14,546][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:01:14,875][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:01:15,204][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:01:15,536][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:01:15,863][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:01:16,195][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:01:16,523][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:01:16,855][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:01:17,182][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:01:17,509][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:01:17,841][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:01:18,168][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:01:18,496][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:01:18,823][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:01:19,150][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:01:19,477][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:01:19,804][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:01:20,132][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:01:20,460][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:01:20,787][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:01:21,113][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:01:21,442][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:01:21,768][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:01:22,096][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:01:22,423][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:01:22,750][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:01:23,077][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:01:23,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:01:24,139][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:01:24,874][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:01:24,875][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:01:24,877][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:01:26,694][__main__][INFO] - Iteration 471 took 25s (39.97% Gen, 52.77% Train). Generation: 10s, Training: 13s. Estimated remaining time: 17h 54m 54s. Estimated total time: 20h 52m 31s. Time estimates for 10 more iterations: 4m 10s, 100 more iterations: 41m 45s, 500 more iterations: 3h 28m 45s. [2025-11-13 11:01:26,696][__main__][INFO] - Starting iteration 471. [2025-11-13 11:01:26,699][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:01:26,699][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:01:35,322][__main__][INFO] - Number of regex retries in iteration 471: 0 [2025-11-13 11:01:35,322][__main__][INFO] - agents played in iteration 471 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 11:01:35,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:35,783][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:35,817][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:35,851][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:35,852][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:01:35,852][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:01:36,578][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:01:36,875][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:01:37,201][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:01:37,526][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:01:37,853][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:01:38,180][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:01:38,508][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:01:38,836][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:01:39,163][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:01:39,491][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:01:39,818][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:01:40,149][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:01:40,479][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:01:40,806][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:01:41,133][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:01:41,461][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:01:41,788][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:01:42,116][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:01:42,444][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:01:42,770][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:01:43,097][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:01:43,423][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:01:43,750][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:01:44,077][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:01:44,406][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:01:44,731][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:01:45,058][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:01:45,386][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:01:45,713][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:01:46,041][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:01:46,369][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:01:46,696][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:01:47,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:01:47,751][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:01:48,470][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:01:48,471][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:01:48,473][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:01:49,419][__main__][INFO] - Iteration 472 took 22s (37.95% Gen, 57.88% Train). Generation: 8s, Training: 13s. Estimated remaining time: 15h 58m 4s. Estimated total time: 18h 56m 4s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 52s, 500 more iterations: 3h 9m 20s. [2025-11-13 11:01:49,421][__main__][INFO] - Starting iteration 472. [2025-11-13 11:01:49,425][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:01:49,426][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:01:58,656][__main__][INFO] - Number of regex retries in iteration 472: 0 [2025-11-13 11:01:58,657][__main__][INFO] - agents played in iteration 472 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 11:01:59,093][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:59,128][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:59,162][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:59,196][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:59,197][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:01:59,197][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:01:59,909][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:02:00,205][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:02:00,530][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:02:00,855][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:02:01,182][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:02:01,509][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:02:01,836][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:02:02,163][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:02:02,490][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:02:02,818][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:02:03,146][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:02:03,477][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:02:03,809][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:02:04,136][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:02:04,464][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:02:04,791][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:02:05,118][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:02:05,444][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:02:05,772][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:02:06,098][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:02:06,425][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:02:06,752][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:02:07,078][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:02:07,405][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:02:07,732][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:02:08,059][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:02:08,386][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:02:08,713][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:02:09,041][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:02:09,368][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:02:09,698][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:02:10,027][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:02:10,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:02:11,087][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:02:11,804][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:02:11,805][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:02:11,807][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:02:12,854][__main__][INFO] - Iteration 473 took 23s (39.40% Gen, 56.13% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 33m 5s. Estimated total time: 19h 31m 28s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 2s, 500 more iterations: 3h 15m 14s. [2025-11-13 11:02:12,856][__main__][INFO] - Starting iteration 473. [2025-11-13 11:02:12,859][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:02:12,859][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:02:22,254][__main__][INFO] - Number of regex retries in iteration 473: 0 [2025-11-13 11:02:22,255][__main__][INFO] - agents played in iteration 473 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 11:02:22,691][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:22,725][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:22,759][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:22,793][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:22,794][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:02:22,794][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:02:23,486][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:02:23,782][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:02:24,109][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:02:24,435][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:02:24,764][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:02:25,091][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:02:25,419][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:02:25,748][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:02:26,076][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:02:26,403][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:02:26,729][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:02:27,061][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:02:27,390][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:02:27,716][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:02:28,043][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:02:28,370][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:02:28,696][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:02:29,023][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:02:29,350][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:02:29,676][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:02:30,003][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:02:30,329][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:02:30,656][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:02:30,983][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:02:31,312][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:02:31,639][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:02:31,968][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:02:32,294][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:02:32,621][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:02:32,947][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:02:33,276][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:02:33,606][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:02:33,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:02:34,661][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:02:35,376][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:02:35,377][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:02:35,379][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:02:36,351][__main__][INFO] - Iteration 474 took 23s (39.99% Gen, 55.86% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 35m 51s. Estimated total time: 19h 34m 38s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 9s, 500 more iterations: 3h 15m 46s. [2025-11-13 11:02:36,353][__main__][INFO] - Starting iteration 474. [2025-11-13 11:02:36,355][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:02:36,356][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:02:45,892][__main__][INFO] - Number of regex retries in iteration 474: 0 [2025-11-13 11:02:45,892][__main__][INFO] - agents played in iteration 474 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 11:02:46,328][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:46,365][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:46,398][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:46,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:46,433][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:02:46,434][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:02:47,156][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:02:47,453][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:02:47,781][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:02:48,110][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:02:48,438][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:02:48,764][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:02:49,092][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:02:49,419][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:02:49,748][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:02:50,079][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:02:50,408][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:02:50,735][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:02:51,062][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:02:51,390][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:02:51,716][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:02:52,042][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:02:52,369][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:02:52,695][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:02:53,023][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:02:53,350][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:02:53,677][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:02:54,004][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:02:54,331][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:02:54,658][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:02:54,985][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:02:55,312][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:02:55,639][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:02:55,965][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:02:56,291][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:02:56,619][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:02:56,946][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:02:57,272][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:02:57,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:02:58,334][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:02:59,071][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:02:59,073][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:02:59,074][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:03:00,027][__main__][INFO] - Iteration 475 took 23s (40.28% Gen, 55.69% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 44m 27s. Estimated total time: 19h 43m 38s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 27s, 500 more iterations: 3h 17m 16s. [2025-11-13 11:03:00,029][__main__][INFO] - Starting iteration 475. [2025-11-13 11:03:00,032][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:03:00,032][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:03:09,160][__main__][INFO] - Number of regex retries in iteration 475: 0 [2025-11-13 11:03:09,161][__main__][INFO] - agents played in iteration 475 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 11:03:09,604][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:09,638][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:09,672][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:09,706][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:09,707][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:03:09,707][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:03:10,441][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:03:10,738][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:03:11,066][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:03:11,393][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:03:11,720][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:03:12,051][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:03:12,377][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:03:12,704][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:03:13,031][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:03:13,359][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:03:13,685][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:03:14,016][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:03:14,346][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:03:14,674][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:03:15,001][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:03:15,329][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:03:15,655][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:03:15,983][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:03:16,309][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:03:16,636][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:03:16,963][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:03:17,289][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:03:17,616][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:03:17,944][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:03:18,271][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:03:18,597][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:03:18,924][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:03:19,252][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:03:19,579][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:03:19,907][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:03:20,233][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:03:20,561][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:03:20,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:03:21,615][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:03:22,347][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:03:22,348][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:03:22,350][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:03:23,337][__main__][INFO] - Iteration 476 took 23s (39.17% Gen, 56.59% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 25m 44s. Estimated total time: 19h 25m 19s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 50s, 500 more iterations: 3h 14m 13s. [2025-11-13 11:03:23,340][__main__][INFO] - Starting iteration 476. [2025-11-13 11:03:23,343][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:03:23,343][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:03:32,710][__main__][INFO] - Number of regex retries in iteration 476: 0 [2025-11-13 11:03:32,711][__main__][INFO] - agents played in iteration 476 are Alice_buffer, Alice, Bob, Bob_buffer [2025-11-13 11:03:33,152][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:33,188][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:33,222][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:33,256][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:33,257][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:03:33,257][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:03:33,971][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:03:34,268][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:03:34,595][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:03:34,921][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:03:35,250][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:03:35,579][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:03:35,908][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:03:36,238][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:03:36,568][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:03:36,896][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:03:37,224][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:03:37,552][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:03:37,881][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:03:38,209][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:03:38,537][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:03:38,863][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:03:39,190][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:03:39,516][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:03:39,844][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:03:40,171][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:03:40,497][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:03:40,824][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:03:41,152][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:03:41,480][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:03:41,807][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:03:42,135][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:03:42,462][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:03:42,789][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:03:43,116][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:03:43,443][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:03:43,770][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:03:44,097][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:03:44,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:03:45,156][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:03:45,907][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:03:45,908][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:03:45,910][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:06:51,640][mllm.models.large_language_model_local][INFO] - Loaded 47 past agent adapters from checkpoints directory. [2025-11-13 11:07:10,396][mllm.models.large_language_model_local][INFO] - Initializing adapter 'agent_adapter': using existing weights from output folder '/scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/Qwen/Qwen2.5-7B-Instruct/adapters/agent_adapter'. [2025-11-13 11:07:11,621][mllm.models.adapter_training_wrapper][INFO] - Adapter 'agent_adapter': loaded initial weights from '/scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/Qwen/Qwen2.5-7B-Instruct/adapters/agent_adapter'. [2025-11-13 11:07:11,629][mllm.models.large_language_model_local][INFO] - Initializing adapter 'critic_adapter': using existing weights from output folder '/scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/Qwen/Qwen2.5-7B-Instruct/adapters/critic_adapter'. [2025-11-13 11:07:12,769][mllm.models.adapter_training_wrapper][WARNING] - Adapter 'critic_adapter': failed to load from '/scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/Qwen/Qwen2.5-7B-Instruct/adapters/critic_adapter': Error while deserializing header: MetadataIncompleteBuffer [2025-11-13 11:07:12,769][mllm.models.adapter_training_wrapper][INFO] - Adapter 'critic_adapter': initialized with fresh weights (no initial weights found). [2025-11-13 11:09:23,622][mllm.training.trainer_common][INFO] - Loading trainer state from /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:09:23,625][mllm.training.trainer_common][INFO] - Loading policy optimizer state from /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:09:24,454][mllm.training.trainer_common][INFO] - Loading critic optimizer state from /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:09:24,457][__main__][INFO] - Starting iteration 476. [2025-11-13 11:09:24,460][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:09:24,461][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:09:53,395][__main__][INFO] - Number of regex retries in iteration 476: 0 [2025-11-13 11:09:53,395][__main__][INFO] - agents played in iteration 476 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:09:53,843][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 39.04%, Block Peak % of device VRAM: 19.44%, ΔTime: 00:00:00 [2025-11-13 11:09:53,883][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 39.04%, Block Peak % of device VRAM: 19.44%, ΔTime: 00:00:00 [2025-11-13 11:09:53,923][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 39.04%, Block Peak % of device VRAM: 19.44%, ΔTime: 00:00:00 [2025-11-13 11:09:53,962][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 39.04%, Block Peak % of device VRAM: 19.44%, ΔTime: 00:00:00 [2025-11-13 11:09:53,963][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:09:53,963][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:09:54,567][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:09:55,180][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:09:55,514][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:09:55,848][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:09:56,181][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:09:56,513][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:09:56,846][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:09:57,177][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:09:57,517][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:09:57,848][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:09:58,177][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:09:58,514][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:09:58,844][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:09:59,177][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:09:59,504][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:09:59,836][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:10:00,161][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:10:00,487][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:10:00,812][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:10:01,142][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:10:01,474][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:10:01,802][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:10:02,131][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:10:02,458][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:10:02,785][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:10:03,119][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:10:03,450][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:10:03,787][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:10:04,112][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:10:04,442][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:10:04,773][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:10:05,100][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:10:05,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:10:06,086][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.78%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:10:06,979][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:10:06,982][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:10:06,984][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:10:08,216][__main__][INFO] - Iteration 477 took 43s (66.13% Gen, 31.06% Train). Generation: 28s, Training: 13s. Estimated remaining time: 36h 24m 31s. Estimated total time: 36h 27m 49s. Time estimates for 10 more iterations: 7m 17s, 100 more iterations: 1h 12m 55s, 500 more iterations: 6h 4m 38s. [2025-11-13 11:10:08,218][__main__][INFO] - Starting iteration 477. [2025-11-13 11:10:08,222][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:10:08,223][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:10:23,116][__main__][INFO] - Number of regex retries in iteration 477: 0 [2025-11-13 11:10:23,117][__main__][INFO] - agents played in iteration 477 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:10:23,533][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:23,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:23,614][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:23,653][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:23,654][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:10:23,654][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:10:24,330][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:10:24,628][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:10:24,956][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:10:25,289][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:10:25,619][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:10:25,947][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:10:26,274][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:10:26,601][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:10:26,929][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:10:27,260][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:10:27,594][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:10:27,929][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:10:28,257][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:10:28,583][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:10:28,913][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:10:29,244][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:10:29,572][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:10:29,898][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:10:30,223][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:10:30,549][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:10:30,874][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:10:31,198][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:10:31,527][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:10:31,854][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:10:32,180][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:10:32,506][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:10:32,832][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:10:33,157][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:10:33,483][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:10:33,810][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:10:34,140][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:10:34,463][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:10:34,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:10:35,425][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:10:36,138][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:10:36,140][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:10:36,142][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:10:37,099][__main__][INFO] - Iteration 478 took 28s (51.57% Gen, 45.10% Train). Generation: 14s, Training: 13s. Estimated remaining time: 24h 0m 7s. Estimated total time: 24h 3m 54s. Time estimates for 10 more iterations: 4m 48s, 100 more iterations: 48m 7s, 500 more iterations: 4h 0m 39s. [2025-11-13 11:10:37,101][__main__][INFO] - Starting iteration 478. [2025-11-13 11:10:37,104][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:10:37,105][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:10:50,002][__main__][INFO] - Number of regex retries in iteration 478: 0 [2025-11-13 11:10:50,003][__main__][INFO] - agents played in iteration 478 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:10:50,466][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:50,506][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:50,545][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:50,584][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:50,584][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:10:50,585][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:10:51,262][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:10:51,563][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:10:51,890][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:10:52,217][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:10:52,544][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:10:52,872][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:10:53,200][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:10:53,526][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:10:53,854][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:10:54,181][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:10:54,510][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:10:54,837][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:10:55,164][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:10:55,490][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:10:55,816][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:10:56,142][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:10:56,470][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:10:56,797][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:10:57,122][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:10:57,450][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:10:57,778][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:10:58,105][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:10:58,433][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:10:58,758][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:10:59,083][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:10:59,413][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:10:59,739][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:11:00,067][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:11:00,392][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:11:00,720][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:11:01,051][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:11:01,381][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:11:01,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:11:02,356][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:11:03,070][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:11:03,071][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:11:03,073][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:11:03,956][__main__][INFO] - Iteration 479 took 26s (48.03% Gen, 48.67% Train). Generation: 12s, Training: 13s. Estimated remaining time: 22h 18m 25s. Estimated total time: 22h 22m 39s. Time estimates for 10 more iterations: 4m 28s, 100 more iterations: 44m 45s, 500 more iterations: 3h 43m 46s. [2025-11-13 11:11:03,958][__main__][INFO] - Starting iteration 479. [2025-11-13 11:11:03,962][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:11:03,962][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:11:17,302][__main__][INFO] - Number of regex retries in iteration 479: 0 [2025-11-13 11:11:17,302][__main__][INFO] - agents played in iteration 479 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:11:17,724][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:17,764][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:17,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:17,857][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:17,858][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:11:17,858][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:11:18,525][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:11:18,823][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:11:19,152][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:11:19,486][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:11:19,814][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:11:20,145][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:11:20,473][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:11:20,800][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:11:21,133][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:11:21,467][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:11:21,793][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:11:22,119][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:11:22,450][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:11:22,775][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:11:23,101][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:11:23,427][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:11:23,753][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:11:24,080][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:11:24,407][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:11:24,734][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:11:25,063][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:11:25,388][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:11:25,714][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:11:26,039][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:11:26,367][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:11:26,694][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:11:27,022][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:11:27,350][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:11:27,676][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:11:28,001][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:11:28,328][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:11:28,656][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:11:28,985][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:11:29,619][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:11:30,356][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:11:30,358][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:11:30,360][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:11:31,231][__main__][INFO] - Iteration 480 took 27s (48.92% Gen, 47.88% Train). Generation: 13s, Training: 13s. Estimated remaining time: 22h 38m 50s. Estimated total time: 22h 43m 32s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 27s, 500 more iterations: 3h 47m 15s. [2025-11-13 11:11:31,234][__main__][INFO] - Starting iteration 480. [2025-11-13 11:11:31,237][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:11:31,238][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:11:41,150][__main__][INFO] - Number of regex retries in iteration 480: 0 [2025-11-13 11:11:41,151][__main__][INFO] - agents played in iteration 480 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:11:41,578][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:41,618][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:41,658][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:41,697][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:41,698][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:11:41,698][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:11:42,370][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:11:42,669][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:11:43,000][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:11:43,337][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:11:43,664][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:11:43,992][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:11:44,320][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:11:44,656][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:11:44,982][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:11:45,310][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:11:45,637][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:11:45,973][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:11:46,300][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:11:46,633][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:11:46,966][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:11:47,293][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:11:47,617][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:11:47,943][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:11:48,272][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:11:48,593][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:11:48,919][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:11:49,247][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:11:49,572][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:11:49,901][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:11:50,231][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:11:50,560][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:11:50,890][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:11:51,223][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:11:51,553][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:11:51,888][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:11:52,218][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:11:52,544][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:11:52,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:11:53,500][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:11:54,213][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:11:54,214][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:11:54,216][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:11:56,034][__main__][INFO] - Iteration 481 took 24s (39.98% Gen, 52.69% Train). Generation: 9s, Training: 13s. Estimated remaining time: 20h 34m 48s. Estimated total time: 20h 39m 54s. Time estimates for 10 more iterations: 4m 7s, 100 more iterations: 41m 19s, 500 more iterations: 3h 26m 39s. [2025-11-13 11:11:56,037][__main__][INFO] - Starting iteration 481. [2025-11-13 11:11:56,040][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:11:56,040][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:12:06,415][__main__][INFO] - Number of regex retries in iteration 481: 0 [2025-11-13 11:12:06,416][__main__][INFO] - agents played in iteration 481 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:12:06,838][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:07,194][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:07,234][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:07,273][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:07,274][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:12:07,274][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:12:07,948][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:12:08,246][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:12:08,573][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:12:08,899][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:12:09,229][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:12:09,559][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:12:09,888][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:12:10,215][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:12:10,541][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:12:10,868][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:12:11,194][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:12:11,521][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:12:11,846][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:12:12,172][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:12:12,498][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:12:12,827][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:12:13,152][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:12:13,480][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:12:13,807][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:12:14,136][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:12:14,465][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:12:14,792][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:12:15,121][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:12:15,449][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:12:15,775][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:12:16,103][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:12:16,431][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:12:16,758][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:12:17,084][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:12:17,412][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:12:17,736][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:12:18,063][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:12:18,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:12:19,042][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:12:19,756][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:12:19,758][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:12:19,760][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:12:20,666][__main__][INFO] - Iteration 482 took 24s (42.13% Gen, 54.18% Train). Generation: 10s, Training: 13s. Estimated remaining time: 20h 25m 51s. Estimated total time: 20h 31m 22s. Time estimates for 10 more iterations: 4m 6s, 100 more iterations: 41m 2s, 500 more iterations: 3h 25m 13s. [2025-11-13 11:12:20,669][__main__][INFO] - Starting iteration 482. [2025-11-13 11:12:20,672][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:12:20,673][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:12:31,050][__main__][INFO] - Number of regex retries in iteration 482: 0 [2025-11-13 11:12:31,050][__main__][INFO] - agents played in iteration 482 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:12:31,478][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:31,518][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:31,558][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:31,597][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:31,598][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:12:31,598][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:12:32,275][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:12:32,573][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:12:32,904][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:12:33,240][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:12:33,566][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:12:33,896][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:12:34,226][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:12:34,557][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:12:34,886][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:12:35,215][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:12:35,541][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:12:35,871][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:12:36,201][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:12:36,530][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:12:36,856][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:12:37,183][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:12:37,510][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:12:37,840][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:12:38,171][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:12:38,500][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:12:38,827][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:12:39,157][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:12:39,485][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:12:39,810][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:12:40,137][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:12:40,470][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:12:40,798][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:12:41,134][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:12:41,459][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:12:41,787][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:12:42,119][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:12:42,456][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:12:42,784][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:12:43,434][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:12:44,148][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:12:44,150][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:12:44,152][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:12:44,986][__main__][INFO] - Iteration 483 took 24s (42.68% Gen, 53.88% Train). Generation: 10s, Training: 13s. Estimated remaining time: 20h 9m 49s. Estimated total time: 20h 15m 44s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 31s, 500 more iterations: 3h 22m 37s. [2025-11-13 11:12:44,988][__main__][INFO] - Starting iteration 483. [2025-11-13 11:12:44,992][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:12:44,992][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:12:54,401][__main__][INFO] - Number of regex retries in iteration 483: 0 [2025-11-13 11:12:54,402][__main__][INFO] - agents played in iteration 483 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:12:54,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:54,869][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:54,908][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:54,948][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:54,948][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:12:54,949][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:12:55,643][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:12:55,943][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:12:56,271][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:12:56,598][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:12:56,927][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:12:57,256][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:12:57,590][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:12:57,917][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:12:58,247][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:12:58,576][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:12:58,905][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:12:59,243][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:12:59,574][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:12:59,903][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:13:00,231][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:13:00,558][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:13:00,884][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:13:01,209][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:13:01,537][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:13:01,862][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:13:02,188][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:13:02,519][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:13:02,851][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:13:03,179][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:13:03,508][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:13:03,833][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:13:04,163][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:13:04,491][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:13:04,817][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:13:05,146][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:13:05,476][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:13:05,803][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:13:06,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:13:06,789][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:13:07,556][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:13:07,558][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:13:07,560][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:13:08,404][__main__][INFO] - Iteration 484 took 23s (40.19% Gen, 56.20% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 24m 21s. Estimated total time: 19h 30m 40s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 1s, 500 more iterations: 3h 15m 6s. [2025-11-13 11:13:08,406][__main__][INFO] - Starting iteration 484. [2025-11-13 11:13:08,409][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:13:08,409][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:13:17,874][__main__][INFO] - Number of regex retries in iteration 484: 0 [2025-11-13 11:13:17,875][__main__][INFO] - agents played in iteration 484 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:13:18,303][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:18,353][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:18,401][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:18,440][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:18,441][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:13:18,442][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:13:19,137][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:13:19,436][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:13:19,765][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:13:20,094][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:13:20,421][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:13:20,747][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:13:21,072][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:13:21,398][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:13:21,728][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:13:22,054][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:13:22,383][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:13:22,710][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:13:23,040][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:13:23,367][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:13:23,695][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:13:24,028][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:13:24,360][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:13:24,688][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:13:25,013][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:13:25,353][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:13:25,679][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:13:26,005][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:13:26,332][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:13:26,659][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:13:26,985][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:13:27,310][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:13:27,638][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:13:27,970][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:13:28,296][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:13:28,627][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:13:28,957][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:13:29,284][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:13:29,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:13:30,273][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:13:31,000][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:13:31,003][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:13:31,004][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:13:31,826][__main__][INFO] - Iteration 485 took 23s (40.42% Gen, 56.07% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 24m 11s. Estimated total time: 19h 30m 53s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 1s, 500 more iterations: 3h 15m 8s. [2025-11-13 11:13:31,828][__main__][INFO] - Starting iteration 485. [2025-11-13 11:13:31,831][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:13:31,832][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:13:41,490][__main__][INFO] - Number of regex retries in iteration 485: 0 [2025-11-13 11:13:41,491][__main__][INFO] - agents played in iteration 485 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:13:41,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:41,958][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:42,000][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:42,041][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:42,041][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:13:42,041][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:13:42,719][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:13:43,018][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:13:43,349][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:13:43,676][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:13:44,001][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:13:44,329][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:13:44,659][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:13:44,994][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:13:45,324][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:13:45,651][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:13:45,982][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:13:46,317][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:13:46,658][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:13:46,985][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:13:47,315][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:13:47,641][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:13:47,970][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:13:48,295][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:13:48,621][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:13:48,947][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:13:49,280][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:13:49,610][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:13:49,941][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:13:50,267][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:13:50,596][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:13:50,927][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:13:51,255][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:13:51,581][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:13:51,911][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:13:52,241][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:13:52,569][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:13:52,896][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:13:53,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:13:53,910][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:13:54,638][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:13:54,640][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:13:54,641][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:13:55,525][__main__][INFO] - Iteration 486 took 23s (40.76% Gen, 55.50% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 37m 40s. Estimated total time: 19h 44m 45s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 29s, 500 more iterations: 3h 17m 27s. [2025-11-13 11:13:55,527][__main__][INFO] - Starting iteration 486. [2025-11-13 11:13:55,530][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:13:55,531][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:14:04,871][__main__][INFO] - Number of regex retries in iteration 486: 0 [2025-11-13 11:14:04,872][__main__][INFO] - agents played in iteration 486 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:14:05,301][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:05,341][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:05,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:05,419][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:05,420][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:14:05,420][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:14:06,110][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:14:06,407][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:14:06,736][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:14:07,067][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:14:07,398][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:14:07,725][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:14:08,050][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:14:08,379][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:14:08,709][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:14:09,036][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:14:09,370][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:14:09,698][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:14:10,025][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:14:10,352][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:14:10,680][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:14:11,006][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:14:11,331][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:14:11,656][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:14:11,985][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:14:12,310][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:14:12,634][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:14:12,963][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:14:13,289][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:14:13,618][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:14:13,943][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:14:14,268][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:14:14,596][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:14:14,923][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:14:15,249][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:14:15,579][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:14:15,908][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:14:16,236][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:14:16,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:14:17,242][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:14:17,947][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:14:17,949][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:14:17,950][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:14:18,792][__main__][INFO] - Iteration 487 took 23s (40.15% Gen, 56.22% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 15m 39s. Estimated total time: 19h 23m 8s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 46s, 500 more iterations: 3h 13m 51s. [2025-11-13 11:14:18,794][__main__][INFO] - Starting iteration 487. [2025-11-13 11:14:18,797][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:14:18,798][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:14:28,863][__main__][INFO] - Number of regex retries in iteration 487: 0 [2025-11-13 11:14:28,864][__main__][INFO] - agents played in iteration 487 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:14:29,289][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:29,329][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:29,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:29,407][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:29,408][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:14:29,408][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:14:30,091][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:14:30,392][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:14:30,719][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:14:31,047][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:14:31,377][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:14:31,703][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:14:32,029][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:14:32,356][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:14:32,685][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:14:33,014][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:14:33,342][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:14:33,669][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:14:33,997][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:14:34,325][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:14:34,657][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:14:34,983][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:14:35,315][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:14:35,644][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:14:35,971][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:14:36,296][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:14:36,622][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:14:36,949][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:14:37,278][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:14:37,606][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:14:37,931][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:14:38,257][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:14:38,585][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:14:38,915][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:14:39,243][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:14:39,571][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:14:39,899][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:14:40,226][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:14:40,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:14:41,238][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:14:41,940][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:14:41,941][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:14:41,943][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:14:42,740][__main__][INFO] - Iteration 488 took 23s (42.04% Gen, 54.62% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 49m 20s. Estimated total time: 19h 57m 13s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 54s, 500 more iterations: 3h 19m 32s. [2025-11-13 11:14:42,742][__main__][INFO] - Starting iteration 488. [2025-11-13 11:14:42,745][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:14:42,746][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:14:51,221][__main__][INFO] - Number of regex retries in iteration 488: 0 [2025-11-13 11:14:51,222][__main__][INFO] - agents played in iteration 488 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:14:51,652][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:51,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:51,732][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:51,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:51,772][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:14:51,773][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:14:52,465][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:14:52,763][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:14:53,091][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:14:53,417][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:14:53,744][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:14:54,070][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:14:54,398][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:14:54,727][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:14:55,055][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:14:55,381][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:14:55,710][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:14:56,037][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:14:56,371][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:14:56,698][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:14:57,024][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:14:57,352][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:14:57,681][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:14:58,007][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:14:58,338][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:14:58,667][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:14:58,995][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:14:59,324][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:14:59,653][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:14:59,987][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:15:00,315][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:15:00,645][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:15:00,978][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:15:01,305][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:15:01,639][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:15:01,971][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:15:02,298][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:15:02,625][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:15:02,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:15:03,613][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:15:04,358][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:15:04,360][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:15:04,361][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:15:05,194][__main__][INFO] - Iteration 489 took 22s (37.75% Gen, 58.53% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 34m 13s. Estimated total time: 18h 42m 29s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 24s, 500 more iterations: 3h 7m 4s. [2025-11-13 11:15:05,196][__main__][INFO] - Starting iteration 489. [2025-11-13 11:15:05,199][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:15:05,200][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:15:15,286][__main__][INFO] - Number of regex retries in iteration 489: 0 [2025-11-13 11:15:15,286][__main__][INFO] - agents played in iteration 489 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:15:15,731][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:15,770][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:15,810][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:15,849][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:15,850][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:15:15,850][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:15:16,539][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:15:16,838][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:15:17,166][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:15:17,495][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:15:17,827][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:15:18,161][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:15:18,491][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:15:18,820][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:15:19,150][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:15:19,477][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:15:19,811][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:15:20,142][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:15:20,469][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:15:20,796][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:15:21,125][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:15:21,453][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:15:21,786][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:15:22,121][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:15:22,452][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:15:22,780][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:15:23,107][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:15:23,434][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:15:23,759][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:15:24,089][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:15:24,416][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:15:24,748][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:15:25,077][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:15:25,412][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:15:25,740][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:15:26,071][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:15:26,406][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:15:26,738][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:15:27,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:15:27,751][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:15:28,441][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:15:28,545][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:15:28,548][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:15:29,420][__main__][INFO] - Iteration 490 took 24s (41.64% Gen, 54.75% Train). Generation: 10s, Training: 13s. Estimated remaining time: 20h 2m 28s. Estimated total time: 20h 11m 7s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 22s, 500 more iterations: 3h 21m 51s. [2025-11-13 11:15:29,422][__main__][INFO] - Starting iteration 490. [2025-11-13 11:15:29,426][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:15:29,426][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:15:39,015][__main__][INFO] - Number of regex retries in iteration 490: 0 [2025-11-13 11:15:39,016][__main__][INFO] - agents played in iteration 490 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:15:39,461][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:39,501][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:39,541][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:39,581][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:39,581][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:15:39,582][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:15:40,279][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:15:40,579][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:15:40,908][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:15:41,235][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:15:41,573][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:15:41,900][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:15:42,227][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:15:42,556][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:15:42,897][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:15:43,225][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:15:43,553][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:15:43,886][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:15:44,216][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:15:44,543][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:15:44,872][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:15:45,199][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:15:45,527][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:15:45,856][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:15:46,187][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:15:46,516][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:15:46,842][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:15:47,168][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:15:47,493][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:15:47,822][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:15:48,147][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:15:48,474][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:15:48,799][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:15:49,127][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:15:49,456][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:15:49,783][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:15:50,112][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:15:50,440][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:15:50,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:15:51,455][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:15:52,174][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:15:52,177][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:15:52,179][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:15:53,898][__main__][INFO] - Iteration 491 took 24s (39.18% Gen, 53.78% Train). Generation: 9s, Training: 13s. Estimated remaining time: 20h 14m 37s. Estimated total time: 20h 23m 41s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 47s, 500 more iterations: 3h 23m 56s. [2025-11-13 11:15:53,903][__main__][INFO] - Starting iteration 491. [2025-11-13 11:15:53,906][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:15:53,907][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:16:03,612][__main__][INFO] - Number of regex retries in iteration 491: 0 [2025-11-13 11:16:03,613][__main__][INFO] - agents played in iteration 491 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:16:04,056][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:04,096][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:04,135][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:04,175][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:04,175][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:16:04,176][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:16:04,860][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:16:05,159][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:16:05,500][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:16:05,829][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:16:06,156][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:16:06,483][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:16:06,815][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:16:07,150][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:16:07,478][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:16:07,804][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:16:08,133][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:16:08,463][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:16:08,789][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:16:09,118][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:16:09,442][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:16:09,769][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:16:10,096][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:16:10,424][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:16:10,751][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:16:11,078][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:16:11,405][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:16:11,733][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:16:12,063][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:16:12,394][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:16:12,723][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:16:13,050][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:16:13,378][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:16:13,712][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:16:14,042][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:16:14,369][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:16:14,702][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:16:15,030][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:16:15,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:16:16,037][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:16:16,725][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:16:16,727][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:16:16,729][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:16:17,670][__main__][INFO] - Iteration 492 took 23s (40.84% Gen, 55.19% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 38m 47s. Estimated total time: 19h 48m 15s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 36s, 500 more iterations: 3h 18m 2s. [2025-11-13 11:16:17,673][__main__][INFO] - Starting iteration 492. [2025-11-13 11:16:17,676][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:16:17,677][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:16:27,246][__main__][INFO] - Number of regex retries in iteration 492: 0 [2025-11-13 11:16:27,247][__main__][INFO] - agents played in iteration 492 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:16:27,668][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:27,724][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:27,765][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:27,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:27,805][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:16:27,805][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:16:28,505][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:16:28,804][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:16:29,132][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:16:29,458][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:16:29,787][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:16:30,119][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:16:30,449][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:16:30,779][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:16:31,111][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:16:31,440][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:16:31,767][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:16:32,101][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:16:32,430][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:16:32,756][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:16:33,086][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:16:33,412][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:16:33,738][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:16:34,065][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:16:34,396][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:16:34,723][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:16:35,050][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:16:35,377][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:16:35,706][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:16:36,040][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:16:36,368][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:16:36,697][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:16:37,024][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:16:37,353][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:16:37,682][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:16:38,012][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:16:38,339][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:16:38,666][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:16:38,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:16:39,719][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:16:40,407][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:16:40,410][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:16:40,412][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:16:41,366][__main__][INFO] - Iteration 493 took 23s (40.39% Gen, 55.57% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 34m 42s. Estimated total time: 19h 44m 33s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 29s, 500 more iterations: 3h 17m 25s. [2025-11-13 11:16:41,369][__main__][INFO] - Starting iteration 493. [2025-11-13 11:16:41,372][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:16:41,372][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:16:51,255][__main__][INFO] - Number of regex retries in iteration 493: 0 [2025-11-13 11:16:51,256][__main__][INFO] - agents played in iteration 493 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:16:51,694][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:51,734][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:51,774][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:51,815][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:51,816][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:16:51,816][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:16:52,511][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:16:52,817][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:16:53,141][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:16:53,471][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:16:53,806][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:16:54,136][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:16:54,465][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:16:54,791][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:16:55,122][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:16:55,451][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:16:55,777][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:16:56,106][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:16:56,440][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:16:56,769][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:16:57,096][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:16:57,424][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:16:57,750][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:16:58,084][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:16:58,414][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:16:58,742][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:16:59,070][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:16:59,397][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:16:59,724][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:17:00,051][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:17:00,378][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:17:00,706][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:17:01,033][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:17:01,361][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:17:01,688][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:17:02,022][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:17:02,345][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:17:02,674][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:17:03,002][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:17:03,668][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:17:04,376][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:17:04,377][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:17:04,380][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:17:05,273][__main__][INFO] - Iteration 494 took 23s (41.35% Gen, 54.91% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 44m 49s. Estimated total time: 19h 55m 5s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 50s, 500 more iterations: 3h 19m 10s. [2025-11-13 11:17:05,275][__main__][INFO] - Starting iteration 494. [2025-11-13 11:17:05,278][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:17:05,279][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:17:14,365][__main__][INFO] - Number of regex retries in iteration 494: 0 [2025-11-13 11:17:14,366][__main__][INFO] - agents played in iteration 494 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:17:14,791][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:14,831][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:14,869][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:14,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:14,910][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:17:14,910][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:17:15,605][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:17:15,906][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:17:16,237][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:17:16,565][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:17:16,893][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:17:17,220][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:17:17,547][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:17:17,877][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:17:18,208][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:17:18,539][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:17:18,869][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:17:19,199][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:17:19,528][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:17:19,855][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:17:20,181][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:17:20,515][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:17:20,843][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:17:21,171][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:17:21,498][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:17:21,825][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:17:22,158][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:17:22,487][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:17:22,815][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:17:23,143][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:17:23,471][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:17:23,797][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:17:24,125][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:17:24,452][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:17:24,779][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:17:25,116][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:17:25,444][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:17:25,771][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:17:26,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:17:26,765][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:17:27,496][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:17:27,497][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:17:27,499][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:17:28,407][__main__][INFO] - Iteration 495 took 23s (39.29% Gen, 56.78% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 5m 50s. Estimated total time: 19h 16m 29s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 32s, 500 more iterations: 3h 12m 44s. [2025-11-13 11:17:28,409][__main__][INFO] - Starting iteration 495. [2025-11-13 11:17:28,413][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:17:28,414][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:17:37,885][__main__][INFO] - Number of regex retries in iteration 495: 0 [2025-11-13 11:17:37,886][__main__][INFO] - agents played in iteration 495 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:17:38,339][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:38,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:38,420][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:38,459][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:38,460][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:17:38,460][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:17:39,147][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:17:39,445][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:17:39,776][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:17:40,102][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:17:40,434][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:17:40,759][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:17:41,088][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:17:41,415][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:17:41,747][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:17:42,075][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:17:42,406][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:17:42,734][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:17:43,065][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:17:43,400][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:17:43,732][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:17:44,061][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:17:44,388][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:17:44,716][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:17:45,044][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:17:45,370][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:17:45,697][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:17:46,024][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:17:46,351][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:17:46,677][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:17:47,006][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:17:47,333][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:17:47,662][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:17:47,991][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:17:48,321][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:17:48,652][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:17:48,980][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:17:49,306][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:17:49,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:17:50,290][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:17:51,016][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:17:51,018][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:17:51,019][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:17:51,891][__main__][INFO] - Iteration 496 took 23s (40.34% Gen, 55.94% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 22m 54s. Estimated total time: 19h 33m 56s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 7s, 500 more iterations: 3h 15m 39s. [2025-11-13 11:17:51,893][__main__][INFO] - Starting iteration 496. [2025-11-13 11:17:51,896][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:17:51,897][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:18:01,027][__main__][INFO] - Number of regex retries in iteration 496: 0 [2025-11-13 11:18:01,028][__main__][INFO] - agents played in iteration 496 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:18:01,462][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:01,502][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:01,542][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:01,581][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:01,582][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:18:01,582][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:18:02,282][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:18:02,579][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:18:02,910][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:18:03,238][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:18:03,567][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:18:03,895][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:18:04,224][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:18:04,558][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:18:04,889][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:18:05,219][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:18:05,551][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:18:05,877][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:18:06,205][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:18:06,534][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:18:06,860][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:18:07,188][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:18:07,516][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:18:07,844][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:18:08,173][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:18:08,506][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:18:08,835][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:18:09,163][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:18:09,494][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:18:09,821][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:18:10,150][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:18:10,477][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:18:10,816][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:18:11,143][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:18:11,469][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:18:11,796][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:18:12,122][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:18:12,451][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:18:12,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:18:13,436][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:18:14,148][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:18:14,150][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:18:14,151][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:18:15,042][__main__][INFO] - Iteration 497 took 23s (39.45% Gen, 56.69% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 5m 55s. Estimated total time: 19h 17m 20s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 34s, 500 more iterations: 3h 12m 53s. [2025-11-13 11:18:15,044][__main__][INFO] - Starting iteration 497. [2025-11-13 11:18:15,048][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:18:15,049][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:18:24,298][__main__][INFO] - Number of regex retries in iteration 497: 0 [2025-11-13 11:18:24,299][__main__][INFO] - agents played in iteration 497 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:18:24,725][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:24,765][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:24,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:24,845][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:24,846][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:18:24,846][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:18:25,566][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:18:25,867][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:18:26,198][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:18:26,533][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:18:26,864][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:18:27,190][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:18:27,519][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:18:27,849][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:18:28,178][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:18:28,506][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:18:28,834][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:18:29,165][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:18:29,493][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:18:29,820][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:18:30,150][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:18:30,475][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:18:30,804][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:18:31,134][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:18:31,461][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:18:31,789][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:18:32,121][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:18:32,451][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:18:32,781][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:18:33,109][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:18:33,436][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:18:33,764][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:18:34,090][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:18:34,430][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:18:34,757][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:18:35,082][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:18:35,409][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:18:35,740][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:18:36,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:18:36,763][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:18:37,471][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:18:37,472][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:18:37,474][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:18:38,371][__main__][INFO] - Iteration 498 took 23s (39.66% Gen, 56.49% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 14m 25s. Estimated total time: 19h 26m 13s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 52s, 500 more iterations: 3h 14m 22s. [2025-11-13 11:18:38,373][__main__][INFO] - Starting iteration 498. [2025-11-13 11:18:38,377][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:18:38,377][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:18:47,338][__main__][INFO] - Number of regex retries in iteration 498: 0 [2025-11-13 11:18:47,339][__main__][INFO] - agents played in iteration 498 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:18:47,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:47,827][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:47,868][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:47,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:47,910][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:18:47,910][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:18:48,652][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:18:48,952][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:18:49,280][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:18:49,607][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:18:49,939][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:18:50,267][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:18:50,597][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:18:50,927][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:18:51,257][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:18:51,585][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:18:51,914][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:18:52,242][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:18:52,573][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:18:52,900][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:18:53,228][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:18:53,556][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:18:53,886][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:18:54,226][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:18:54,557][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:18:54,884][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:18:55,212][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:18:55,543][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:18:55,872][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:18:56,202][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:18:56,536][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:18:56,866][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:18:57,192][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:18:57,521][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:18:57,852][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:18:58,178][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:18:58,508][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:18:58,836][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:18:59,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:18:59,849][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:19:00,573][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:19:00,575][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:19:00,577][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:19:01,452][__main__][INFO] - Iteration 499 took 23s (38.83% Gen, 57.37% Train). Generation: 8s, Training: 13s. Estimated remaining time: 19h 1m 37s. Estimated total time: 19h 13m 48s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 27s, 500 more iterations: 3h 12m 18s. [2025-11-13 11:19:01,454][__main__][INFO] - Starting iteration 499. [2025-11-13 11:19:01,458][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:19:01,458][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:19:10,006][__main__][INFO] - Number of regex retries in iteration 499: 0 [2025-11-13 11:19:10,007][__main__][INFO] - agents played in iteration 499 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:19:10,449][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:10,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:10,530][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:10,570][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:10,571][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:19:10,571][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:19:11,313][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:19:11,613][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:19:11,943][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:19:12,279][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:19:12,611][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:19:12,938][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:19:13,268][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:19:13,596][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:19:13,926][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:19:14,254][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:19:14,585][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:19:14,914][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:19:15,244][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:19:15,571][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:19:15,898][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:19:16,229][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:19:16,559][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:19:16,887][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:19:17,217][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:19:17,549][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:19:17,876][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:19:18,205][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:19:18,531][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:19:18,857][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:19:19,183][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:19:19,510][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:19:19,840][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:19:20,167][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:19:20,500][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:19:20,835][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:19:21,166][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:19:21,495][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:19:21,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:19:22,517][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:19:23,251][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:19:23,253][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:19:23,255][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:19:24,206][__main__][INFO] - Iteration 500 took 22s (37.58% Gen, 58.23% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 44m 54s. Estimated total time: 18h 57m 29s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 54s, 500 more iterations: 3h 9m 34s. [2025-11-13 11:19:24,208][__main__][INFO] - Starting iteration 500. [2025-11-13 11:19:24,212][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:19:24,212][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:19:32,855][__main__][INFO] - Number of regex retries in iteration 500: 0 [2025-11-13 11:19:32,855][__main__][INFO] - agents played in iteration 500 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:19:33,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:33,371][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:33,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:33,453][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:33,454][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:19:33,454][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:19:34,181][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:19:34,480][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:19:34,810][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:19:35,141][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:19:35,469][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:19:35,799][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:19:36,142][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:19:36,470][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:19:36,799][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:19:37,127][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:19:37,460][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:19:37,787][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:19:38,117][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:19:38,450][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:19:38,774][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:19:39,101][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:19:39,430][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:19:39,764][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:19:40,089][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:19:40,416][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:19:40,743][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:19:41,071][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:19:41,400][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:19:41,728][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:19:42,055][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:19:42,385][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:19:42,721][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:19:43,050][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:19:43,379][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:19:43,707][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:19:44,043][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:19:44,376][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:19:44,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:19:45,406][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:19:46,122][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:19:46,123][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:19:46,125][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:19:47,977][__main__][INFO] - Iteration 501 took 23s (36.36% Gen, 55.84% Train). Generation: 8s, Training: 13s. Estimated remaining time: 19h 35m 20s. Estimated total time: 19h 48m 18s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 36s, 500 more iterations: 3h 18m 3s. [2025-11-13 11:19:47,981][__main__][INFO] - Starting iteration 501. [2025-11-13 11:19:47,984][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:19:47,985][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:19:56,210][__main__][INFO] - Number of regex retries in iteration 501: 0 [2025-11-13 11:19:56,210][__main__][INFO] - agents played in iteration 501 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:19:56,660][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:56,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:56,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:56,790][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:56,791][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:19:56,791][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:19:57,531][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:19:57,831][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:19:58,164][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:19:58,491][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:19:58,819][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:19:59,147][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:19:59,475][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:19:59,804][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:20:00,133][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:20:00,462][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:20:00,790][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:20:01,131][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:20:01,460][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:20:01,788][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:20:02,116][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:20:02,451][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:20:02,781][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:20:03,109][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:20:03,437][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:20:03,767][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:20:04,094][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:20:04,426][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:20:04,757][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:20:05,079][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:20:05,408][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:20:05,735][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:20:06,063][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:20:06,401][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:20:06,726][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:20:07,053][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:20:07,381][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:20:07,724][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:20:08,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:20:08,742][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:20:09,474][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:20:09,476][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:20:09,477][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:20:10,345][__main__][INFO] - Iteration 502 took 22s (36.78% Gen, 59.33% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 24m 44s. Estimated total time: 18h 38m 4s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 16s, 500 more iterations: 3h 6m 20s. [2025-11-13 11:20:10,347][__main__][INFO] - Starting iteration 502. [2025-11-13 11:20:10,351][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:20:10,351][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:20:19,291][__main__][INFO] - Number of regex retries in iteration 502: 0 [2025-11-13 11:20:19,292][__main__][INFO] - agents played in iteration 502 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:20:19,741][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:19,782][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:19,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:19,863][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:19,864][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:20:19,864][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:20:20,610][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:20:20,913][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:20:21,242][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:20:21,570][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:20:21,899][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:20:22,227][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:20:22,554][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:20:22,882][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:20:23,212][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:20:23,551][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:20:23,880][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:20:24,209][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:20:24,537][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:20:24,868][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:20:25,197][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:20:25,527][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:20:25,854][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:20:26,184][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:20:26,514][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:20:26,841][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:20:27,170][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:20:27,500][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:20:27,829][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:20:28,159][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:20:28,486][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:20:28,812][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:20:29,143][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:20:29,469][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:20:29,796][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:20:30,135][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:20:30,460][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:20:30,793][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:20:31,119][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:20:31,810][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:20:32,519][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:20:32,521][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:20:32,522][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:20:33,425][__main__][INFO] - Iteration 503 took 23s (38.75% Gen, 57.33% Train). Generation: 8s, Training: 13s. Estimated remaining time: 19h 0m 3s. Estimated total time: 19h 13m 47s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 27s, 500 more iterations: 3h 12m 17s. [2025-11-13 11:20:33,427][__main__][INFO] - Starting iteration 503. [2025-11-13 11:20:33,431][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:20:33,432][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:20:42,328][__main__][INFO] - Number of regex retries in iteration 503: 0 [2025-11-13 11:20:42,329][__main__][INFO] - agents played in iteration 503 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:20:42,758][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:42,799][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:42,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:42,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:42,881][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:20:42,882][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:20:43,627][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:20:43,927][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:20:44,255][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:20:44,581][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:20:44,910][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:20:45,238][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:20:45,567][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:20:45,893][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:20:46,220][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:20:46,548][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:20:46,877][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:20:47,212][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:20:47,542][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:20:47,870][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:20:48,200][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:20:48,538][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:20:48,867][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:20:49,195][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:20:49,523][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:20:49,853][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:20:50,181][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:20:50,511][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:20:50,842][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:20:51,172][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:20:51,501][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:20:51,830][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:20:52,160][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:20:52,494][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:20:52,820][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:20:53,148][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:20:53,476][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:20:53,804][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:20:54,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:20:54,823][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:20:55,561][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:20:55,564][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:20:55,565][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:20:56,438][__main__][INFO] - Iteration 504 took 23s (38.67% Gen, 57.53% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 56m 16s. Estimated total time: 19h 10m 23s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 20s, 500 more iterations: 3h 11m 43s. [2025-11-13 11:20:56,440][__main__][INFO] - Starting iteration 504. [2025-11-13 11:20:56,443][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:20:56,444][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:21:05,575][__main__][INFO] - Number of regex retries in iteration 504: 0 [2025-11-13 11:21:05,575][__main__][INFO] - agents played in iteration 504 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:21:06,040][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:06,080][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:06,122][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:06,163][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:06,163][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:21:06,164][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:21:06,893][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:21:07,196][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:21:07,527][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:21:07,855][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:21:08,183][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:21:08,511][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:21:08,838][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:21:09,166][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:21:09,495][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:21:09,822][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:21:10,149][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:21:10,476][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:21:10,804][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:21:11,143][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:21:11,470][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:21:11,798][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:21:12,126][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:21:12,454][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:21:12,782][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:21:13,111][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:21:13,442][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:21:13,775][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:21:14,099][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:21:14,426][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:21:14,755][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:21:15,081][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:21:15,414][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:21:15,739][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:21:16,067][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:21:16,394][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:21:16,722][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:21:17,051][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:21:17,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:21:18,069][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:21:18,778][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:21:18,780][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:21:18,781][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:21:19,684][__main__][INFO] - Iteration 505 took 23s (39.29% Gen, 56.82% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 7m 35s. Estimated total time: 19h 22m 5s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 44s, 500 more iterations: 3h 13m 40s. [2025-11-13 11:21:19,686][__main__][INFO] - Starting iteration 505. [2025-11-13 11:21:19,689][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:21:19,690][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:21:28,194][__main__][INFO] - Number of regex retries in iteration 505: 0 [2025-11-13 11:21:28,194][__main__][INFO] - agents played in iteration 505 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:21:28,632][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:28,672][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:28,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:28,754][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:28,754][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:21:28,755][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:21:29,487][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:21:29,785][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:21:30,113][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:21:30,442][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:21:30,772][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:21:31,097][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:21:31,424][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:21:31,752][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:21:32,083][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:21:32,406][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:21:32,735][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:21:33,063][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:21:33,390][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:21:33,718][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:21:34,046][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:21:34,374][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:21:34,702][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:21:35,030][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:21:35,361][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:21:35,691][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:21:36,019][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:21:36,347][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:21:36,676][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:21:37,004][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:21:37,333][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:21:37,664][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:21:37,998][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:21:38,331][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:21:38,663][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:21:38,991][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:21:39,322][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:21:39,651][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:21:39,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:21:40,676][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:21:41,412][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:21:41,413][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:21:41,415][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:21:42,358][__main__][INFO] - Iteration 506 took 22s (37.51% Gen, 58.32% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 38m 37s. Estimated total time: 18h 53m 29s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 46s, 500 more iterations: 3h 8m 54s. [2025-11-13 11:21:42,360][__main__][INFO] - Starting iteration 506. [2025-11-13 11:21:42,363][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:21:42,364][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:21:50,958][__main__][INFO] - Number of regex retries in iteration 506: 0 [2025-11-13 11:21:50,958][__main__][INFO] - agents played in iteration 506 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:21:51,393][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:51,440][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:51,487][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:51,528][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:51,529][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:21:51,529][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:21:52,261][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:21:52,562][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:21:52,891][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:21:53,221][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:21:53,562][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:21:53,889][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:21:54,216][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:21:54,544][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:21:54,871][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:21:55,197][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:21:55,525][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:21:55,852][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:21:56,182][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:21:56,510][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:21:56,838][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:21:57,165][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:21:57,493][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:21:57,822][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:21:58,151][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:21:58,479][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:21:58,808][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:21:59,134][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:21:59,462][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:21:59,788][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:22:00,117][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:22:00,448][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:22:00,775][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:22:01,102][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:22:01,429][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:22:01,757][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:22:02,087][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:22:02,411][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:22:02,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:22:03,416][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:22:04,142][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:22:04,144][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:22:04,146][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:22:05,016][__main__][INFO] - Iteration 507 took 22s (37.94% Gen, 58.21% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 37m 25s. Estimated total time: 18h 52m 40s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 45s, 500 more iterations: 3h 8m 46s. [2025-11-13 11:22:05,018][__main__][INFO] - Starting iteration 507. [2025-11-13 11:22:05,022][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:22:05,022][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:22:13,575][__main__][INFO] - Number of regex retries in iteration 507: 0 [2025-11-13 11:22:13,575][__main__][INFO] - agents played in iteration 507 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:22:14,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:14,060][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:14,101][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:14,142][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:14,143][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:22:14,143][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:22:14,878][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:22:15,179][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:22:15,507][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:22:15,833][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:22:16,162][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:22:16,492][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:22:16,819][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:22:17,146][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:22:17,474][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:22:17,801][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:22:18,132][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:22:18,461][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:22:18,791][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:22:19,119][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:22:19,456][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:22:19,783][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:22:20,112][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:22:20,440][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:22:20,771][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:22:21,100][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:22:21,429][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:22:21,756][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:22:22,085][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:22:22,414][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:22:22,742][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:22:23,067][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:22:23,393][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:22:23,725][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:22:24,053][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:22:24,384][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:22:24,712][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:22:25,040][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:22:25,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:22:26,085][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:22:26,799][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:22:26,800][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:22:26,802][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:22:27,704][__main__][INFO] - Iteration 508 took 22s (37.71% Gen, 58.31% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 38m 32s. Estimated total time: 18h 54m 10s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 48s, 500 more iterations: 3h 9m 1s. [2025-11-13 11:22:27,706][__main__][INFO] - Starting iteration 508. [2025-11-13 11:22:27,709][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:22:27,710][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:22:36,730][__main__][INFO] - Number of regex retries in iteration 508: 0 [2025-11-13 11:22:36,730][__main__][INFO] - agents played in iteration 508 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:22:37,168][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:37,208][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:37,249][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:37,290][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:37,291][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:22:37,291][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:22:38,049][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:22:38,350][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:22:38,681][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:22:39,009][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:22:39,336][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:22:39,664][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:22:39,992][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:22:40,321][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:22:40,649][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:22:40,977][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:22:41,305][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:22:41,632][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:22:41,971][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:22:42,299][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:22:42,627][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:22:42,954][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:22:43,281][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:22:43,608][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:22:43,937][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:22:44,270][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:22:44,594][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:22:44,922][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:22:45,249][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:22:45,578][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:22:45,907][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:22:46,236][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:22:46,564][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:22:46,896][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:22:47,223][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:22:47,551][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:22:47,879][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:22:48,206][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:22:48,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:22:49,219][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:22:49,954][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:22:49,956][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:22:49,957][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:22:50,923][__main__][INFO] - Iteration 509 took 23s (38.86% Gen, 56.97% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 4m 43s. Estimated total time: 19h 20m 44s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 41s, 500 more iterations: 3h 13m 27s. [2025-11-13 11:22:50,925][__main__][INFO] - Starting iteration 509. [2025-11-13 11:22:50,929][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:22:50,930][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:22:59,892][__main__][INFO] - Number of regex retries in iteration 509: 0 [2025-11-13 11:22:59,893][__main__][INFO] - agents played in iteration 509 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:23:00,328][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:00,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:00,409][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:00,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:00,451][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:23:00,451][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:23:01,169][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:23:01,472][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:23:01,801][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:23:02,128][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:23:02,457][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:23:02,791][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:23:03,123][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:23:03,452][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:23:03,780][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:23:04,110][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:23:04,439][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:23:04,765][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:23:05,092][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:23:05,419][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:23:05,747][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:23:06,076][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:23:06,404][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:23:06,734][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:23:07,061][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:23:07,389][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:23:07,718][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:23:08,046][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:23:08,374][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:23:08,701][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:23:09,031][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:23:09,359][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:23:09,690][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:23:10,023][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:23:10,349][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:23:10,681][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:23:11,016][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:23:11,342][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:23:11,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:23:12,388][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:23:13,107][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:23:13,109][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:23:13,111][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:23:13,975][__main__][INFO] - Iteration 510 took 23s (38.89% Gen, 57.35% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 55m 55s. Estimated total time: 19h 12m 20s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 24s, 500 more iterations: 3h 12m 3s. [2025-11-13 11:23:13,977][__main__][INFO] - Starting iteration 510. [2025-11-13 11:23:13,980][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:23:13,981][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:23:22,720][__main__][INFO] - Number of regex retries in iteration 510: 0 [2025-11-13 11:23:22,721][__main__][INFO] - agents played in iteration 510 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:23:23,154][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:23,194][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:23,236][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:23,278][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:23,278][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:23:23,279][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:23:24,014][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:23:24,314][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:23:24,643][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:23:24,971][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:23:25,300][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:23:25,629][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:23:25,958][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:23:26,290][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:23:26,613][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:23:26,941][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:23:27,271][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:23:27,604][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:23:27,926][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:23:28,253][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:23:28,579][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:23:28,906][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:23:29,233][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:23:29,561][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:23:29,888][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:23:30,217][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:23:30,544][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:23:30,872][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:23:31,200][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:23:31,534][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:23:31,864][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:23:32,193][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:23:32,521][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:23:32,849][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:23:33,175][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:23:33,504][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:23:33,834][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:23:34,160][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:23:34,491][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:23:35,184][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:23:35,916][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:23:35,917][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:23:35,919][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:23:37,684][__main__][INFO] - Iteration 511 took 23s (36.87% Gen, 55.68% Train). Generation: 8s, Training: 13s. Estimated remaining time: 19h 28m 25s. Estimated total time: 19h 45m 13s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 30s, 500 more iterations: 3h 17m 32s. [2025-11-13 11:23:37,686][__main__][INFO] - Starting iteration 511. [2025-11-13 11:23:37,689][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:23:37,690][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:23:46,275][__main__][INFO] - Number of regex retries in iteration 511: 0 [2025-11-13 11:23:46,276][__main__][INFO] - agents played in iteration 511 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:23:46,714][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:46,754][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:46,794][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:46,835][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:46,836][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:23:46,836][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:23:47,579][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:23:47,878][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:23:48,208][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:23:48,540][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:23:48,876][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:23:49,204][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:23:49,533][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:23:49,862][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:23:50,191][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:23:50,517][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:23:50,844][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:23:51,173][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:23:51,500][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:23:51,826][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:23:52,153][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:23:52,480][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:23:52,807][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:23:53,135][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:23:53,462][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:23:53,789][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:23:54,120][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:23:54,447][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:23:54,775][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:23:55,109][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:23:55,436][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:23:55,763][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:23:56,092][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:23:56,422][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:23:56,750][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:23:57,076][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:23:57,405][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:23:57,739][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:23:58,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:23:58,753][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:23:59,484][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:23:59,485][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:23:59,488][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:24:00,388][__main__][INFO] - Iteration 512 took 22s (37.83% Gen, 58.20% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 37m 49s. Estimated total time: 18h 55m 0s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 50s, 500 more iterations: 3h 9m 10s. [2025-11-13 11:24:00,390][__main__][INFO] - Starting iteration 512. [2025-11-13 11:24:00,393][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:24:00,394][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:24:09,551][__main__][INFO] - Number of regex retries in iteration 512: 0 [2025-11-13 11:24:09,552][__main__][INFO] - agents played in iteration 512 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:24:09,989][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:10,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:10,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:10,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:10,113][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:24:10,113][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:24:10,853][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:24:11,152][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:24:11,491][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:24:11,821][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:24:12,149][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:24:12,476][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:24:12,807][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:24:13,136][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:24:13,465][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:24:13,797][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:24:14,121][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:24:14,449][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:24:14,776][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:24:15,109][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:24:15,433][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:24:15,760][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:24:16,088][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:24:16,417][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:24:16,753][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:24:17,081][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:24:17,411][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:24:17,738][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:24:18,071][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:24:18,399][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:24:18,728][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:24:19,059][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:24:19,399][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:24:19,733][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:24:20,060][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:24:20,388][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:24:20,717][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:24:21,045][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:24:21,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:24:22,079][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:24:22,790][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:24:22,792][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:24:22,794][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:24:23,688][__main__][INFO] - Iteration 513 took 23s (39.31% Gen, 56.84% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 7m 14s. Estimated total time: 19h 24m 48s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 49s, 500 more iterations: 3h 14m 8s. [2025-11-13 11:24:23,691][__main__][INFO] - Starting iteration 513. [2025-11-13 11:24:23,694][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:24:23,695][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:24:32,397][__main__][INFO] - Number of regex retries in iteration 513: 0 [2025-11-13 11:24:32,398][__main__][INFO] - agents played in iteration 513 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:24:32,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:32,869][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:32,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:32,950][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:32,950][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:24:32,951][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:24:33,678][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:24:33,984][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:24:34,313][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:24:34,642][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:24:34,972][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:24:35,310][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:24:35,640][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:24:35,969][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:24:36,298][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:24:36,632][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:24:36,962][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:24:37,290][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:24:37,625][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:24:37,945][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:24:38,273][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:24:38,603][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:24:38,931][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:24:39,261][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:24:39,590][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:24:39,920][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:24:40,250][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:24:40,579][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:24:40,906][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:24:41,233][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:24:41,560][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:24:41,888][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:24:42,216][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:24:42,545][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:24:42,872][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:24:43,208][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:24:43,535][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:24:43,864][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:24:44,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:24:44,880][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:24:45,610][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:24:45,614][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:24:45,617][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:24:46,560][__main__][INFO] - Iteration 514 took 22s (38.06% Gen, 57.81% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 45m 23s. Estimated total time: 19h 3m 20s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 6s, 500 more iterations: 3h 10m 33s. [2025-11-13 11:24:46,562][__main__][INFO] - Starting iteration 514. [2025-11-13 11:24:46,566][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:24:46,566][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:24:54,547][__main__][INFO] - Number of regex retries in iteration 514: 0 [2025-11-13 11:24:54,547][__main__][INFO] - agents played in iteration 514 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:24:54,998][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:55,038][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:55,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:55,120][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:55,120][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:24:55,120][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:24:55,844][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:24:56,143][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:24:56,472][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:24:56,801][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:24:57,132][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:24:57,461][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:24:57,795][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:24:58,123][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:24:58,454][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:24:58,787][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:24:59,116][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:24:59,451][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:24:59,779][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:25:00,108][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:25:00,439][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:25:00,767][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:25:01,094][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:25:01,423][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:25:01,752][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:25:02,082][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:25:02,410][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:25:02,739][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:25:03,067][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:25:03,395][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:25:03,722][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:25:04,050][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:25:04,381][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:25:04,709][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:25:05,037][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:25:05,367][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:25:05,697][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:25:06,025][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:25:06,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:25:07,047][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:25:07,761][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:25:07,763][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:25:07,764][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:25:08,890][__main__][INFO] - Iteration 515 took 22s (35.75% Gen, 59.20% Train). Generation: 7s, Training: 13s. Estimated remaining time: 18h 17m 57s. Estimated total time: 18h 36m 16s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 12s, 500 more iterations: 3h 6m 2s. [2025-11-13 11:25:08,892][__main__][INFO] - Starting iteration 515. [2025-11-13 11:25:08,896][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:25:08,897][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:25:17,475][__main__][INFO] - Number of regex retries in iteration 515: 0 [2025-11-13 11:25:17,475][__main__][INFO] - agents played in iteration 515 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:25:17,905][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:17,945][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:17,986][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:18,027][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:18,027][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:25:18,028][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:25:18,741][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:25:19,039][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:25:19,367][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:25:19,694][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:25:20,022][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:25:20,357][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:25:20,684][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:25:21,010][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:25:21,341][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:25:21,671][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:25:22,000][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:25:22,332][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:25:22,661][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:25:22,988][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:25:23,315][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:25:23,643][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:25:23,970][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:25:24,298][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:25:24,625][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:25:24,954][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:25:25,282][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:25:25,610][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:25:25,938][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:25:26,265][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:25:26,593][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:25:26,921][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:25:27,248][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:25:27,575][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:25:27,904][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:25:28,242][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:25:28,570][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:25:28,899][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:25:29,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:25:29,918][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:25:30,645][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:25:30,647][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:25:30,649][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:25:31,861][__main__][INFO] - Iteration 516 took 22s (37.35% Gen, 57.36% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 49m 35s. Estimated total time: 19h 8m 18s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 16s, 500 more iterations: 3h 11m 23s. [2025-11-13 11:25:31,863][__main__][INFO] - Starting iteration 516. [2025-11-13 11:25:31,867][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:25:31,868][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:25:40,601][__main__][INFO] - Number of regex retries in iteration 516: 0 [2025-11-13 11:25:40,602][__main__][INFO] - agents played in iteration 516 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:25:41,046][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:41,087][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:41,128][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:41,170][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:41,170][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:25:41,170][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:25:41,893][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:25:42,192][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:25:42,524][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:25:42,853][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:25:43,184][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:25:43,515][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:25:43,844][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:25:44,171][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:25:44,498][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:25:44,826][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:25:45,160][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:25:45,496][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:25:45,831][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:25:46,164][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:25:46,494][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:25:46,822][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:25:47,149][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:25:47,477][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:25:47,804][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:25:48,132][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:25:48,459][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:25:48,786][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:25:49,114][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:25:49,441][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:25:49,769][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:25:50,096][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:25:50,424][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:25:50,751][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:25:51,079][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:25:51,407][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:25:51,734][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:25:52,062][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:25:52,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:25:53,090][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:25:53,802][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:25:53,804][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:25:53,806][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:25:54,705][__main__][INFO] - Iteration 517 took 22s (38.24% Gen, 57.81% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 42m 51s. Estimated total time: 19h 1m 56s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 3s, 500 more iterations: 3h 10m 19s. [2025-11-13 11:25:54,707][__main__][INFO] - Starting iteration 517. [2025-11-13 11:25:54,710][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:25:54,711][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:26:02,937][__main__][INFO] - Number of regex retries in iteration 517: 0 [2025-11-13 11:26:02,938][__main__][INFO] - agents played in iteration 517 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:26:03,369][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:03,409][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:03,449][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:03,490][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:03,491][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:26:03,492][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:26:04,232][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:26:04,531][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:26:04,859][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:26:05,185][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:26:05,513][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:26:05,841][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:26:06,169][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:26:06,496][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:26:06,822][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:26:07,153][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:26:07,483][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:26:07,810][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:26:08,144][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:26:08,472][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:26:08,800][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:26:09,129][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:26:09,459][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:26:09,787][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:26:10,117][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:26:10,452][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:26:10,779][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:26:11,108][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:26:11,437][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:26:11,764][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:26:12,091][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:26:12,419][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:26:12,748][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:26:13,075][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:26:13,403][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:26:13,731][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:26:14,059][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:26:14,387][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:26:14,714][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:26:15,406][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:26:16,136][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:26:16,137][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:26:16,139][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:26:17,076][__main__][INFO] - Iteration 518 took 22s (36.78% Gen, 59.02% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 18m 53s. Estimated total time: 18h 38m 20s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 16s, 500 more iterations: 3h 6m 23s. [2025-11-13 11:26:17,081][__main__][INFO] - Starting iteration 518. [2025-11-13 11:26:17,084][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:26:17,085][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:26:25,879][__main__][INFO] - Number of regex retries in iteration 518: 0 [2025-11-13 11:26:25,879][__main__][INFO] - agents played in iteration 518 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:26:26,318][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:26,358][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:26,399][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:26,439][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:26,439][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:26:26,439][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:26:27,147][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:26:27,446][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:26:27,774][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:26:28,102][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:26:28,431][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:26:28,759][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:26:29,088][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:26:29,415][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:26:29,749][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:26:30,080][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:26:30,410][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:26:30,740][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:26:31,068][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:26:31,400][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:26:31,729][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:26:32,059][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:26:32,387][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:26:32,716][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:26:33,044][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:26:33,375][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:26:33,703][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:26:34,032][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:26:34,360][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:26:34,686][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:26:35,013][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:26:35,341][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:26:35,667][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:26:35,996][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:26:36,323][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:26:36,651][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:26:36,979][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:26:37,306][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:26:37,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:26:38,334][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:26:39,040][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:26:39,043][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:26:39,046][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:26:39,950][__main__][INFO] - Iteration 519 took 22s (38.46% Gen, 57.58% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 43m 30s. Estimated total time: 19h 3m 20s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 6s, 500 more iterations: 3h 10m 33s. [2025-11-13 11:26:39,956][__main__][INFO] - Starting iteration 519. [2025-11-13 11:26:39,959][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:26:39,959][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:26:48,278][__main__][INFO] - Number of regex retries in iteration 519: 0 [2025-11-13 11:26:48,278][__main__][INFO] - agents played in iteration 519 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:26:48,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:48,768][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:48,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:48,848][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:48,849][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:26:48,849][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:26:49,562][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:26:49,865][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:26:50,192][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:26:50,522][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:26:50,852][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:26:51,180][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:26:51,509][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:26:51,840][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:26:52,170][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:26:52,496][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:26:52,824][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:26:53,151][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:26:53,477][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:26:53,804][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:26:54,140][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:26:54,469][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:26:54,797][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:26:55,130][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:26:55,456][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:26:55,785][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:26:56,116][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:26:56,454][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:26:56,780][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:26:57,107][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:26:57,434][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:26:57,764][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:26:58,092][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:26:58,420][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:26:58,748][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:26:59,076][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:26:59,410][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:26:59,736][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:27:00,063][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:27:00,769][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:27:01,456][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:27:01,458][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:27:01,460][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:27:02,333][__main__][INFO] - Iteration 520 took 22s (37.18% Gen, 58.91% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 18m 32s. Estimated total time: 18h 38m 44s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 17s, 500 more iterations: 3h 6m 27s. [2025-11-13 11:27:02,335][__main__][INFO] - Starting iteration 520. [2025-11-13 11:27:02,339][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:27:02,339][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:27:11,088][__main__][INFO] - Number of regex retries in iteration 520: 0 [2025-11-13 11:27:11,089][__main__][INFO] - agents played in iteration 520 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:27:11,516][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:11,556][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:11,597][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:11,638][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:11,638][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:27:11,639][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:27:12,367][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:27:12,668][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:27:12,997][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:27:13,324][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:27:13,655][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:27:13,984][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:27:14,311][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:27:14,638][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:27:14,966][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:27:15,296][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:27:15,625][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:27:15,955][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:27:16,297][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:27:16,622][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:27:16,953][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:27:17,284][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:27:17,623][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:27:17,952][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:27:18,287][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:27:18,618][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:27:18,952][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:27:19,288][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:27:19,618][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:27:19,953][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:27:20,276][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:27:20,605][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:27:20,934][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:27:21,262][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:27:21,589][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:27:21,917][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:27:22,244][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:27:22,573][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:27:22,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:27:23,604][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:27:24,283][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:27:24,284][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:27:24,286][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:27:25,986][__main__][INFO] - Iteration 521 took 23s (37.00% Gen, 55.80% Train). Generation: 8s, Training: 13s. Estimated remaining time: 19h 21m 49s. Estimated total time: 19h 42m 25s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 24s, 500 more iterations: 3h 17m 4s. [2025-11-13 11:27:25,988][__main__][INFO] - Starting iteration 521. [2025-11-13 11:27:25,992][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:27:25,992][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:27:34,319][__main__][INFO] - Number of regex retries in iteration 521: 0 [2025-11-13 11:27:34,320][__main__][INFO] - agents played in iteration 521 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:27:34,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:34,800][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:34,840][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:34,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:34,881][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:27:34,881][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:27:35,589][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:27:35,958][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:27:36,289][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:27:36,618][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:27:36,948][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:27:37,275][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:27:37,604][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:27:37,935][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:27:38,262][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:27:38,589][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:27:38,919][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:27:39,246][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:27:39,575][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:27:39,914][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:27:40,244][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:27:40,573][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:27:40,904][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:27:41,245][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:27:41,574][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:27:41,903][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:27:42,233][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:27:42,568][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:27:42,895][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:27:43,223][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:27:43,558][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:27:43,880][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:27:44,207][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:27:44,534][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:27:44,864][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:27:45,191][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:27:45,519][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:27:45,846][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:27:46,175][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:27:46,886][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:27:47,567][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:27:47,568][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:27:47,570][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:27:48,423][__main__][INFO] - Iteration 522 took 22s (37.12% Gen, 59.07% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 20m 38s. Estimated total time: 18h 41m 37s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 23s, 500 more iterations: 3h 6m 56s. [2025-11-13 11:27:48,425][__main__][INFO] - Starting iteration 522. [2025-11-13 11:27:48,428][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:27:48,429][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:27:57,628][__main__][INFO] - Number of regex retries in iteration 522: 0 [2025-11-13 11:27:57,629][__main__][INFO] - agents played in iteration 522 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:27:58,069][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:58,111][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:58,150][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:58,190][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:58,191][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:27:58,191][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:27:58,912][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:27:59,212][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:27:59,542][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:27:59,869][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:28:00,197][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:28:00,528][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:28:00,861][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:28:01,191][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:28:01,518][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:28:01,846][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:28:02,172][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:28:02,498][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:28:02,827][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:28:03,160][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:28:03,481][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:28:03,809][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:28:04,138][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:28:04,470][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:28:04,802][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:28:05,132][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:28:05,467][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:28:05,796][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:28:06,123][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:28:06,451][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:28:06,780][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:28:07,109][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:28:07,437][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:28:07,764][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:28:08,092][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:28:08,421][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:28:08,749][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:28:09,076][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:28:09,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:28:10,108][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:28:10,790][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:28:10,792][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:28:10,794][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:28:11,623][__main__][INFO] - Iteration 523 took 23s (39.66% Gen, 56.76% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 58m 24s. Estimated total time: 19h 19m 46s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 39s, 500 more iterations: 3h 13m 17s. [2025-11-13 11:28:11,625][__main__][INFO] - Starting iteration 523. [2025-11-13 11:28:11,628][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:28:11,628][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:28:20,431][__main__][INFO] - Number of regex retries in iteration 523: 0 [2025-11-13 11:28:20,432][__main__][INFO] - agents played in iteration 523 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:28:20,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:20,905][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:20,945][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:20,986][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:20,986][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:28:20,987][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:28:21,719][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:28:22,019][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:28:22,349][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:28:22,678][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:28:23,017][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:28:23,345][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:28:23,672][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:28:23,999][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:28:24,340][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:28:24,666][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:28:24,994][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:28:25,322][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:28:25,659][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:28:25,987][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:28:26,316][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:28:26,649][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:28:26,974][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:28:27,302][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:28:27,632][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:28:27,963][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:28:28,288][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:28:28,619][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:28:28,947][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:28:29,278][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:28:29,606][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:28:29,937][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:28:30,264][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:28:30,594][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:28:30,923][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:28:31,253][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:28:31,580][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:28:31,908][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:28:32,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:28:32,940][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:28:33,630][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:28:33,631][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:28:33,633][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:28:34,486][__main__][INFO] - Iteration 524 took 22s (38.51% Gen, 57.75% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 41m 13s. Estimated total time: 19h 2m 58s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 5s, 500 more iterations: 3h 10m 29s. [2025-11-13 11:28:34,488][__main__][INFO] - Starting iteration 524. [2025-11-13 11:28:34,491][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:28:34,492][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:28:43,027][__main__][INFO] - Number of regex retries in iteration 524: 0 [2025-11-13 11:28:43,028][__main__][INFO] - agents played in iteration 524 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:28:43,458][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:43,499][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:43,539][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:43,585][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:43,585][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:28:43,586][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:28:44,298][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:28:44,598][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:28:44,929][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:28:45,267][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:28:45,597][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:28:45,925][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:28:46,257][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:28:46,582][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:28:46,910][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:28:47,245][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:28:47,581][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:28:47,911][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:28:48,238][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:28:48,568][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:28:48,894][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:28:49,223][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:28:49,552][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:28:49,884][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:28:50,216][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:28:50,555][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:28:50,889][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:28:51,224][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:28:51,553][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:28:51,886][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:28:52,213][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:28:52,541][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:28:52,874][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:28:53,196][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:28:53,524][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:28:53,850][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:28:54,183][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:28:54,506][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:28:54,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:28:55,533][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:28:56,225][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:28:56,229][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:28:56,231][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:28:57,204][__main__][INFO] - Iteration 525 took 22s (37.58% Gen, 58.13% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 33m 33s. Estimated total time: 18h 55m 40s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 51s, 500 more iterations: 3h 9m 16s. [2025-11-13 11:28:57,206][__main__][INFO] - Starting iteration 525. [2025-11-13 11:28:57,208][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:28:57,209][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:29:05,608][__main__][INFO] - Number of regex retries in iteration 525: 0 [2025-11-13 11:29:05,608][__main__][INFO] - agents played in iteration 525 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:29:06,050][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:06,091][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:06,131][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:06,172][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:06,173][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:29:06,173][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:29:06,898][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:29:07,199][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:29:07,527][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:29:07,858][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:29:08,187][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:29:08,516][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:29:08,849][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:29:09,179][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:29:09,509][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:29:09,839][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:29:10,171][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:29:10,506][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:29:10,843][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:29:11,173][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:29:11,504][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:29:11,840][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:29:12,176][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:29:12,509][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:29:12,856][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:29:13,184][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:29:13,514][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:29:13,841][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:29:14,173][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:29:14,508][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:29:14,849][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:29:15,177][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:29:15,506][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:29:15,835][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:29:16,164][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:29:16,492][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:29:16,827][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:29:17,155][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:29:17,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:29:18,212][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:29:18,893][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:29:18,895][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:29:18,896][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:29:19,837][__main__][INFO] - Iteration 526 took 22s (37.12% Gen, 58.72% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 28m 59s. Estimated total time: 18h 51m 29s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 42s, 500 more iterations: 3h 8m 34s. [2025-11-13 11:29:19,839][__main__][INFO] - Starting iteration 526. [2025-11-13 11:29:19,842][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:29:19,843][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:29:28,811][__main__][INFO] - Number of regex retries in iteration 526: 0 [2025-11-13 11:29:28,811][__main__][INFO] - agents played in iteration 526 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:29:29,239][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:29,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:29,319][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:29,360][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:29,361][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:29:29,362][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:29:30,082][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:29:30,381][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:29:30,710][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:29:31,039][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:29:31,372][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:29:31,705][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:29:32,037][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:29:32,369][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:29:32,693][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:29:33,020][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:29:33,347][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:29:33,682][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:29:34,005][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:29:34,332][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:29:34,660][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:29:34,989][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:29:35,317][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:29:35,653][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:29:35,985][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:29:36,316][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:29:36,651][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:29:36,986][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:29:37,318][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:29:37,647][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:29:37,979][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:29:38,306][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:29:38,634][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:29:38,967][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:29:39,291][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:29:39,619][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:29:39,948][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:29:40,279][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:29:40,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:29:41,309][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:29:42,015][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:29:42,019][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:29:42,021][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:29:42,810][__main__][INFO] - Iteration 527 took 22s (39.04% Gen, 57.51% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 45m 33s. Estimated total time: 19h 8m 26s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 16s, 500 more iterations: 3h 11m 24s. [2025-11-13 11:29:42,812][__main__][INFO] - Starting iteration 527. [2025-11-13 11:29:42,814][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:29:42,815][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:29:51,234][__main__][INFO] - Number of regex retries in iteration 527: 0 [2025-11-13 11:29:51,235][__main__][INFO] - agents played in iteration 527 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:29:51,676][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:51,716][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:51,756][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:51,797][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:51,798][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:29:51,798][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:29:52,515][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:29:52,814][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:29:53,143][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:29:53,472][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:29:53,802][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:29:54,131][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:29:54,459][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:29:54,787][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:29:55,114][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:29:55,442][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:29:55,771][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:29:56,101][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:29:56,430][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:29:56,757][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:29:57,083][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:29:57,413][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:29:57,745][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:29:58,073][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:29:58,402][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:29:58,731][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:29:59,059][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:29:59,388][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:29:59,716][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:30:00,045][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:30:00,375][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:30:00,703][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:30:01,032][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:30:01,367][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:30:01,692][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:30:02,020][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:30:02,348][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:30:02,684][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:30:03,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:30:03,726][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:30:04,437][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:30:04,438][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:30:04,440][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:30:05,276][__main__][INFO] - Iteration 528 took 22s (37.48% Gen, 58.79% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 19m 51s. Estimated total time: 18h 43m 7s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 26s, 500 more iterations: 3h 7m 11s. [2025-11-13 11:30:05,277][__main__][INFO] - Starting iteration 528. [2025-11-13 11:30:05,280][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:30:05,281][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:30:14,299][__main__][INFO] - Number of regex retries in iteration 528: 0 [2025-11-13 11:30:14,300][__main__][INFO] - agents played in iteration 528 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:30:14,729][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:14,770][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:14,810][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:14,851][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:14,852][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:30:14,852][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:30:15,579][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:30:15,879][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:30:16,208][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:30:16,536][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:30:16,862][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:30:17,189][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:30:17,518][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:30:17,844][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:30:18,171][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:30:18,504][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:30:18,841][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:30:19,168][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:30:19,496][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:30:19,823][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:30:20,154][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:30:20,484][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:30:20,814][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:30:21,146][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:30:21,482][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:30:21,812][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:30:22,145][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:30:22,474][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:30:22,805][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:30:23,133][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:30:23,467][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:30:23,801][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:30:24,133][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:30:24,464][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:30:24,797][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:30:25,125][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:30:25,454][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:30:25,783][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:30:26,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:30:26,830][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:30:27,554][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:30:27,555][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:30:27,557][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:30:28,361][__main__][INFO] - Iteration 529 took 23s (39.07% Gen, 57.44% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 50m 27s. Estimated total time: 19h 14m 6s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 28s, 500 more iterations: 3h 12m 21s. [2025-11-13 11:30:28,363][__main__][INFO] - Starting iteration 529. [2025-11-13 11:30:28,366][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:30:28,366][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:30:37,629][__main__][INFO] - Number of regex retries in iteration 529: 0 [2025-11-13 11:30:37,630][__main__][INFO] - agents played in iteration 529 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:30:38,067][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:38,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:38,148][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:38,189][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:38,190][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:30:38,190][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:30:38,938][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:30:39,238][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:30:39,566][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:30:39,896][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:30:40,223][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:30:40,551][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:30:40,887][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:30:41,216][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:30:41,545][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:30:41,872][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:30:42,207][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:30:42,532][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:30:42,864][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:30:43,192][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:30:43,528][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:30:43,863][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:30:44,194][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:30:44,521][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:30:44,849][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:30:45,186][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:30:45,517][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:30:45,849][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:30:46,182][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:30:46,509][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:30:46,838][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:30:47,167][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:30:47,494][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:30:47,822][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:30:48,152][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:30:48,484][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:30:48,812][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:30:49,141][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:30:49,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:30:50,193][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:30:50,896][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:30:50,897][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:30:50,899][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:30:51,752][__main__][INFO] - Iteration 530 took 23s (39.61% Gen, 56.74% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 5m 18s. Estimated total time: 19h 29m 20s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 58s, 500 more iterations: 3h 14m 53s. [2025-11-13 11:30:51,754][__main__][INFO] - Starting iteration 530. [2025-11-13 11:30:51,757][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:30:51,757][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:31:01,173][__main__][INFO] - Number of regex retries in iteration 530: 0 [2025-11-13 11:31:01,174][__main__][INFO] - agents played in iteration 530 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:31:01,610][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:01,652][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:01,707][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:01,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:01,749][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:31:01,750][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:31:02,459][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:31:02,759][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:31:03,086][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:31:03,417][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:31:03,746][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:31:04,076][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:31:04,405][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:31:04,741][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:31:05,070][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:31:05,401][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:31:05,731][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:31:06,059][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:31:06,390][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:31:06,721][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:31:07,054][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:31:07,394][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:31:07,724][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:31:08,053][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:31:08,386][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:31:08,712][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:31:09,041][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:31:09,368][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:31:09,699][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:31:10,023][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:31:10,351][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:31:10,679][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:31:11,009][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:31:11,335][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:31:11,662][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:31:11,990][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:31:12,318][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:31:12,646][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:31:12,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:31:13,693][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:31:14,386][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:31:14,388][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:31:14,389][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:31:16,166][__main__][INFO] - Iteration 531 took 24s (38.58% Gen, 54.14% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 56m 4s. Estimated total time: 20h 20m 30s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 41s, 500 more iterations: 3h 23m 25s. [2025-11-13 11:31:16,168][__main__][INFO] - Starting iteration 531. [2025-11-13 11:31:16,171][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:31:16,172][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:31:25,777][__main__][INFO] - Number of regex retries in iteration 531: 0 [2025-11-13 11:31:25,777][__main__][INFO] - agents played in iteration 531 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:31:26,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:26,253][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:26,294][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:26,334][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:26,335][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:31:26,335][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:31:27,032][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:31:27,332][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:31:27,665][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:31:28,002][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:31:28,326][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:31:28,655][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:31:28,984][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:31:29,313][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:31:29,641][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:31:29,971][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:31:30,305][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:31:30,633][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:31:30,963][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:31:31,293][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:31:31,628][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:31:31,957][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:31:32,290][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:31:32,621][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:31:32,951][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:31:33,279][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:31:33,607][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:31:33,939][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:31:34,269][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:31:34,599][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:31:34,929][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:31:35,258][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:31:35,587][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:31:35,916][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:31:36,246][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:31:36,573][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:31:36,906][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:31:37,234][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:31:37,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:31:38,268][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:31:38,982][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:31:38,983][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:31:38,985][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:31:39,791][__main__][INFO] - Iteration 532 took 23s (40.66% Gen, 55.91% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 16m 13s. Estimated total time: 19h 41m 3s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 22s, 500 more iterations: 3h 16m 50s. [2025-11-13 11:31:39,796][__main__][INFO] - Starting iteration 532. [2025-11-13 11:31:39,799][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:31:39,800][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:31:48,951][__main__][INFO] - Number of regex retries in iteration 532: 0 [2025-11-13 11:31:48,951][__main__][INFO] - agents played in iteration 532 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:31:49,392][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:49,432][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:49,473][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:49,514][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:49,514][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:31:49,515][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:31:50,224][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:31:50,521][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:31:50,853][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:31:51,180][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:31:51,509][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:31:51,836][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:31:52,168][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:31:52,496][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:31:52,827][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:31:53,155][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:31:53,485][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:31:53,813][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:31:54,143][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:31:54,471][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:31:54,801][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:31:55,129][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:31:55,459][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:31:55,787][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:31:56,122][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:31:56,450][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:31:56,777][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:31:57,111][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:31:57,439][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:31:57,767][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:31:58,097][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:31:58,427][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:31:58,755][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:31:59,082][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:31:59,411][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:31:59,739][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:32:00,067][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:32:00,396][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:32:00,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:32:01,466][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:32:02,189][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:32:02,191][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:32:02,193][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:32:03,209][__main__][INFO] - Iteration 533 took 23s (39.09% Gen, 56.56% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 5m 19s. Estimated total time: 19h 30m 33s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 1s, 500 more iterations: 3h 15m 5s. [2025-11-13 11:32:03,211][__main__][INFO] - Starting iteration 533. [2025-11-13 11:32:03,215][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:32:03,216][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:32:12,736][__main__][INFO] - Number of regex retries in iteration 533: 0 [2025-11-13 11:32:12,737][__main__][INFO] - agents played in iteration 533 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:32:13,175][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:13,216][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:13,256][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:13,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:13,297][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:32:13,297][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:32:14,015][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:32:14,313][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:32:14,643][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:32:14,972][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:32:15,301][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:32:15,630][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:32:15,958][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:32:16,287][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:32:16,615][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:32:16,947][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:32:17,277][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:32:17,605][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:32:17,933][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:32:18,263][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:32:18,593][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:32:18,923][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:32:19,251][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:32:19,581][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:32:19,910][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:32:20,238][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:32:20,568][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:32:20,897][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:32:21,231][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:32:21,558][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:32:21,888][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:32:22,217][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:32:22,547][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:32:22,875][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:32:23,202][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:32:23,532][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:32:23,873][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:32:24,201][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:32:24,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:32:25,259][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:32:25,993][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:32:25,995][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:32:25,996][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:32:26,931][__main__][INFO] - Iteration 534 took 23s (40.14% Gen, 55.91% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 20m 14s. Estimated total time: 19h 45m 51s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 31s, 500 more iterations: 3h 17m 38s. [2025-11-13 11:32:26,933][__main__][INFO] - Starting iteration 534. [2025-11-13 11:32:26,937][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:32:26,937][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:32:36,634][__main__][INFO] - Number of regex retries in iteration 534: 0 [2025-11-13 11:32:36,635][__main__][INFO] - agents played in iteration 534 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:32:37,061][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:37,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:37,145][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:37,186][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:37,187][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:32:37,187][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:32:37,907][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:32:38,212][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:32:38,535][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:32:38,866][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:32:39,194][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:32:39,529][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:32:39,859][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:32:40,193][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:32:40,525][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:32:40,855][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:32:41,185][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:32:41,516][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:32:41,844][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:32:42,172][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:32:42,510][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:32:42,841][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:32:43,173][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:32:43,501][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:32:43,829][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:32:44,157][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:32:44,485][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:32:44,813][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:32:45,141][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:32:45,470][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:32:45,799][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:32:46,127][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:32:46,454][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:32:46,783][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:32:47,111][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:32:47,438][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:32:47,768][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:32:48,098][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:32:48,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:32:49,122][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:32:49,839][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:32:49,841][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:32:49,842][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:32:50,730][__main__][INFO] - Iteration 535 took 23s (40.76% Gen, 55.51% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 23m 40s. Estimated total time: 19h 49m 41s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 39s, 500 more iterations: 3h 18m 16s. [2025-11-13 11:32:50,732][__main__][INFO] - Starting iteration 535. [2025-11-13 11:32:50,735][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:32:50,736][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:33:00,236][__main__][INFO] - Number of regex retries in iteration 535: 0 [2025-11-13 11:33:00,237][__main__][INFO] - agents played in iteration 535 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:33:00,670][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:00,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:00,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:00,789][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:00,790][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:33:00,790][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:33:01,515][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:33:01,815][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:33:02,150][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:33:02,482][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:33:02,813][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:33:03,142][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:33:03,473][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:33:03,800][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:33:04,128][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:33:04,458][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:33:04,793][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:33:05,122][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:33:05,451][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:33:05,780][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:33:06,107][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:33:06,436][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:33:06,766][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:33:07,097][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:33:07,425][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:33:07,756][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:33:08,084][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:33:08,413][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:33:08,743][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:33:09,075][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:33:09,403][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:33:09,737][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:33:10,063][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:33:10,391][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:33:10,722][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:33:11,057][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:33:11,384][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:33:11,719][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:33:12,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:33:12,744][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:33:13,477][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:33:13,479][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:33:13,481][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:33:14,354][__main__][INFO] - Iteration 536 took 23s (40.22% Gen, 56.07% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 14m 36s. Estimated total time: 19h 41m 1s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 22s, 500 more iterations: 3h 16m 50s. [2025-11-13 11:33:14,356][__main__][INFO] - Starting iteration 536. [2025-11-13 11:33:14,360][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:33:14,361][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:33:24,622][__main__][INFO] - Number of regex retries in iteration 536: 0 [2025-11-13 11:33:24,623][__main__][INFO] - agents played in iteration 536 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:33:25,056][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:25,098][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:25,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:25,179][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:25,179][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:33:25,179][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:33:25,921][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:33:26,222][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:33:26,554][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:33:26,889][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:33:27,217][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:33:27,547][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:33:27,874][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:33:28,207][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:33:28,534][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:33:28,862][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:33:29,189][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:33:29,517][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:33:29,846][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:33:30,173][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:33:30,505][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:33:30,835][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:33:31,164][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:33:31,493][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:33:31,821][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:33:32,148][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:33:32,477][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:33:32,803][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:33:33,130][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:33:33,457][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:33:33,784][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:33:34,112][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:33:34,439][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:33:34,766][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:33:35,095][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:33:35,423][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:33:35,751][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:33:36,081][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:33:36,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:33:37,112][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:33:37,855][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:33:37,856][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:33:37,858][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:33:38,768][__main__][INFO] - Iteration 537 took 24s (42.04% Gen, 54.22% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 53m 36s. Estimated total time: 20h 20m 25s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 40s, 500 more iterations: 3h 23m 24s. [2025-11-13 11:33:38,770][__main__][INFO] - Starting iteration 537. [2025-11-13 11:33:38,773][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:33:38,774][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:33:48,683][__main__][INFO] - Number of regex retries in iteration 537: 0 [2025-11-13 11:33:48,684][__main__][INFO] - agents played in iteration 537 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:33:49,123][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:49,163][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:49,203][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:49,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:49,245][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:33:49,246][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:33:49,986][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:33:50,286][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:33:50,625][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:33:50,953][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:33:51,281][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:33:51,609][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:33:51,936][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:33:52,263][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:33:52,592][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:33:52,920][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:33:53,246][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:33:53,576][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:33:53,903][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:33:54,233][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:33:54,566][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:33:54,893][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:33:55,225][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:33:55,555][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:33:55,889][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:33:56,217][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:33:56,545][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:33:56,872][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:33:57,211][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:33:57,544][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:33:57,875][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:33:58,203][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:33:58,533][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:33:58,867][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:33:59,200][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:33:59,533][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:33:59,856][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:34:00,185][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:34:00,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:34:01,204][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:34:01,927][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:34:01,929][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:34:01,930][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:34:02,865][__main__][INFO] - Iteration 538 took 24s (41.13% Gen, 54.98% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 37m 25s. Estimated total time: 20h 4m 38s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 9s, 500 more iterations: 3h 20m 46s. [2025-11-13 11:34:02,868][__main__][INFO] - Starting iteration 538. [2025-11-13 11:34:02,871][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:34:02,872][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:34:12,142][__main__][INFO] - Number of regex retries in iteration 538: 0 [2025-11-13 11:34:12,143][__main__][INFO] - agents played in iteration 538 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:34:12,582][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:12,636][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:12,678][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:12,719][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:12,719][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:34:12,720][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:34:13,446][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:34:13,747][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:34:14,078][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:34:14,408][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:34:14,737][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:34:15,067][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:34:15,396][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:34:15,728][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:34:16,056][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:34:16,385][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:34:16,714][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:34:17,044][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:34:17,375][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:34:17,703][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:34:18,030][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:34:18,358][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:34:18,686][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:34:19,014][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:34:19,341][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:34:19,668][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:34:19,997][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:34:20,326][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:34:20,653][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:34:20,982][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:34:21,309][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:34:21,637][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:34:21,964][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:34:22,294][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:34:22,623][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:34:22,951][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:34:23,280][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:34:23,609][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:34:23,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:34:24,604][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:34:25,316][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:34:25,318][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:34:25,320][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:34:26,272][__main__][INFO] - Iteration 539 took 23s (39.62% Gen, 56.31% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 2m 29s. Estimated total time: 19h 30m 6s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 0s, 500 more iterations: 3h 15m 1s. [2025-11-13 11:34:26,274][__main__][INFO] - Starting iteration 539. [2025-11-13 11:34:26,278][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:34:26,278][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:34:36,558][__main__][INFO] - Number of regex retries in iteration 539: 0 [2025-11-13 11:34:36,559][__main__][INFO] - agents played in iteration 539 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:34:37,008][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:37,048][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:37,088][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:37,129][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:37,129][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:34:37,129][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:34:37,882][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:34:38,183][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:34:38,513][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:34:38,844][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:34:39,173][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:34:39,501][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:34:39,829][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:34:40,157][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:34:40,484][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:34:40,819][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:34:41,146][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:34:41,474][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:34:41,801][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:34:42,128][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:34:42,457][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:34:42,787][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:34:43,127][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:34:43,454][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:34:43,785][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:34:44,120][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:34:44,454][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:34:44,785][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:34:45,115][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:34:45,445][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:34:45,770][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:34:46,097][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:34:46,427][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:34:46,753][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:34:47,079][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:34:47,409][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:34:47,739][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:34:48,067][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:34:48,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:34:49,056][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:34:49,785][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:34:49,787][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:34:49,789][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:34:50,689][__main__][INFO] - Iteration 540 took 24s (42.11% Gen, 54.19% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 52m 36s. Estimated total time: 20h 20m 37s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 41s, 500 more iterations: 3h 23m 26s. [2025-11-13 11:34:50,691][__main__][INFO] - Starting iteration 540. [2025-11-13 11:34:50,695][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:34:50,695][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:35:00,067][__main__][INFO] - Number of regex retries in iteration 540: 0 [2025-11-13 11:35:00,067][__main__][INFO] - agents played in iteration 540 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:35:00,499][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:00,539][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:00,578][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:00,618][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:00,619][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:35:00,619][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:35:01,369][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:35:01,670][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:35:02,000][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:35:02,329][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:35:02,659][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:35:02,992][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:35:03,324][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:35:03,652][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:35:03,979][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:35:04,306][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:35:04,633][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:35:04,962][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:35:05,293][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:35:05,625][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:35:05,952][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:35:06,279][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:35:06,607][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:35:06,938][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:35:07,266][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:35:07,592][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:35:07,922][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:35:08,250][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:35:08,582][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:35:08,913][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:35:09,249][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:35:09,580][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:35:09,911][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:35:10,244][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:35:10,571][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:35:10,899][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:35:11,225][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:35:11,554][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:35:11,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:35:12,548][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:35:13,288][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:35:13,289][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:35:13,291][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:35:15,108][__main__][INFO] - Iteration 541 took 24s (38.39% Gen, 54.16% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 52m 18s. Estimated total time: 20h 20m 44s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 41s, 500 more iterations: 3h 23m 27s. [2025-11-13 11:35:15,110][__main__][INFO] - Starting iteration 541. [2025-11-13 11:35:15,114][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:35:15,114][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:35:25,205][__main__][INFO] - Number of regex retries in iteration 541: 0 [2025-11-13 11:35:25,206][__main__][INFO] - agents played in iteration 541 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:35:25,626][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:25,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:25,707][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:25,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:25,747][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:35:25,747][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:35:26,487][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:35:26,786][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:35:27,115][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:35:27,443][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:35:27,773][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:35:28,100][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:35:28,428][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:35:28,757][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:35:29,085][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:35:29,415][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:35:29,744][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:35:30,072][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:35:30,398][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:35:30,729][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:35:31,058][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:35:31,386][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:35:31,721][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:35:32,051][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:35:32,379][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:35:32,708][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:35:33,040][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:35:33,366][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:35:33,694][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:35:34,025][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:35:34,355][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:35:34,679][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:35:35,007][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:35:35,335][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:35:35,662][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:35:35,992][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:35:36,321][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:35:36,650][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:35:36,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:35:37,643][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:35:38,379][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:35:38,381][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:35:38,383][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:35:39,310][__main__][INFO] - Iteration 542 took 24s (41.71% Gen, 54.46% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 41m 0s. Estimated total time: 20h 9m 50s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 19s, 500 more iterations: 3h 21m 38s. [2025-11-13 11:35:39,312][__main__][INFO] - Starting iteration 542. [2025-11-13 11:35:39,315][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:35:39,315][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:35:50,418][__main__][INFO] - Number of regex retries in iteration 542: 0 [2025-11-13 11:35:50,419][__main__][INFO] - agents played in iteration 542 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:35:50,856][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:50,897][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:50,937][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:50,978][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:50,978][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:35:50,979][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:35:51,731][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:35:52,030][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:35:52,359][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:35:52,689][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:35:53,018][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:35:53,346][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:35:53,677][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:35:54,004][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:35:54,343][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:35:54,674][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:35:55,008][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:35:55,343][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:35:55,674][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:35:56,000][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:35:56,327][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:35:56,655][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:35:56,982][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:35:57,311][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:35:57,638][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:35:57,967][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:35:58,296][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:35:58,623][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:35:58,950][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:35:59,279][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:35:59,608][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:35:59,936][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:36:00,264][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:36:00,597][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:36:00,926][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:36:01,260][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:36:01,591][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:36:01,919][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:36:02,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:36:02,944][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:36:03,687][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:36:03,688][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:36:03,690][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:36:04,598][__main__][INFO] - Iteration 543 took 25s (43.91% Gen, 52.49% Train). Generation: 11s, Training: 13s. Estimated remaining time: 20h 34m 56s. Estimated total time: 21h 4m 11s. Time estimates for 10 more iterations: 4m 12s, 100 more iterations: 42m 8s, 500 more iterations: 3h 30m 41s. [2025-11-13 11:36:04,600][__main__][INFO] - Starting iteration 543. [2025-11-13 11:36:04,604][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:36:04,604][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:36:15,092][__main__][INFO] - Number of regex retries in iteration 543: 0 [2025-11-13 11:36:15,093][__main__][INFO] - agents played in iteration 543 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:36:15,514][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:15,558][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:15,598][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:15,639][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:15,639][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:36:15,639][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:36:16,375][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:36:16,678][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:36:17,008][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:36:17,342][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:36:17,677][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:36:18,009][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:36:18,338][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:36:18,665][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:36:18,993][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:36:19,320][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:36:19,646][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:36:19,973][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:36:20,302][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:36:20,628][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:36:20,955][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:36:21,284][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:36:21,614][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:36:21,946][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:36:22,273][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:36:22,601][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:36:22,928][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:36:23,264][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:36:23,592][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:36:23,926][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:36:24,255][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:36:24,580][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:36:24,909][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:36:25,236][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:36:25,570][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:36:25,894][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:36:26,219][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:36:26,548][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:36:26,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:36:27,543][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:36:28,289][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:36:28,290][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:36:28,292][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:36:29,190][__main__][INFO] - Iteration 544 took 24s (42.66% Gen, 53.68% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 59m 44s. Estimated total time: 20h 29m 24s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 58s, 500 more iterations: 3h 24m 54s. [2025-11-13 11:36:29,193][__main__][INFO] - Starting iteration 544. [2025-11-13 11:36:29,197][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:36:29,197][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:36:39,668][__main__][INFO] - Number of regex retries in iteration 544: 0 [2025-11-13 11:36:39,669][__main__][INFO] - agents played in iteration 544 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:36:40,099][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:40,140][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:40,195][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:40,234][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:40,235][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:36:40,235][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:36:40,958][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:36:41,259][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:36:41,591][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:36:41,920][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:36:42,249][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:36:42,580][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:36:42,907][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:36:43,234][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:36:43,563][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:36:43,890][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:36:44,219][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:36:44,549][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:36:44,876][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:36:45,203][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:36:45,537][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:36:45,864][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:36:46,190][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:36:46,520][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:36:46,862][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:36:47,191][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:36:47,519][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:36:47,846][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:36:48,178][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:36:48,504][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:36:48,831][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:36:49,159][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:36:49,488][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:36:49,815][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:36:50,143][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:36:50,474][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:36:50,797][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:36:51,125][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:36:51,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:36:52,130][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:36:52,850][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:36:52,851][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:36:52,853][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:36:53,741][__main__][INFO] - Iteration 545 took 24s (42.66% Gen, 53.71% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 57m 11s. Estimated total time: 20h 27m 15s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 54s, 500 more iterations: 3h 24m 32s. [2025-11-13 11:36:53,743][__main__][INFO] - Starting iteration 545. [2025-11-13 11:36:53,746][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:36:53,747][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:37:04,211][__main__][INFO] - Number of regex retries in iteration 545: 0 [2025-11-13 11:37:04,211][__main__][INFO] - agents played in iteration 545 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:37:04,636][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:04,677][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:04,718][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:04,758][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:04,759][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:37:04,759][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:37:05,500][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:37:05,799][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:37:06,129][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:37:06,463][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:37:06,793][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:37:07,122][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:37:07,451][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:37:07,781][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:37:08,106][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:37:08,433][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:37:08,760][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:37:09,087][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:37:09,413][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:37:09,739][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:37:10,068][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:37:10,395][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:37:10,728][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:37:11,054][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:37:11,381][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:37:11,708][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:37:12,043][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:37:12,371][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:37:12,698][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:37:13,028][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:37:13,365][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:37:13,692][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:37:14,019][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:37:14,347][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:37:14,679][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:37:15,007][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:37:15,336][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:37:15,667][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:37:15,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:37:16,664][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:37:17,404][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:37:17,405][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:37:17,407][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:37:18,335][__main__][INFO] - Iteration 546 took 24s (42.55% Gen, 53.66% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 59m 2s. Estimated total time: 20h 29m 30s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 59s, 500 more iterations: 3h 24m 55s. [2025-11-13 11:37:18,338][__main__][INFO] - Starting iteration 546. [2025-11-13 11:37:18,341][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:37:18,341][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:37:28,445][__main__][INFO] - Number of regex retries in iteration 546: 0 [2025-11-13 11:37:28,446][__main__][INFO] - agents played in iteration 546 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:37:28,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:28,934][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:28,975][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:29,015][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:29,016][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:37:29,016][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:37:29,745][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:37:30,050][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:37:30,380][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:37:30,708][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:37:31,036][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:37:31,366][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:37:31,696][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:37:32,025][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:37:32,352][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:37:32,680][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:37:33,008][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:37:33,337][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:37:33,667][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:37:33,997][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:37:34,325][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:37:34,655][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:37:34,983][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:37:35,309][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:37:35,635][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:37:35,962][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:37:36,289][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:37:36,619][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:37:36,949][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:37:37,283][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:37:37,614][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:37:37,952][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:37:38,283][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:37:38,613][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:37:38,940][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:37:39,275][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:37:39,602][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:37:39,928][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:37:40,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:37:40,952][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:37:41,688][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:37:41,690][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:37:41,691][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:37:42,561][__main__][INFO] - Iteration 547 took 24s (41.72% Gen, 54.69% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 40m 12s. Estimated total time: 20h 11m 5s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 22s, 500 more iterations: 3h 21m 50s. [2025-11-13 11:37:42,563][__main__][INFO] - Starting iteration 547. [2025-11-13 11:37:42,567][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:37:42,567][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:37:52,469][__main__][INFO] - Number of regex retries in iteration 547: 0 [2025-11-13 11:37:52,470][__main__][INFO] - agents played in iteration 547 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:37:52,920][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:52,967][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:53,008][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:53,048][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:53,049][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:37:53,049][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:37:53,789][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:37:54,092][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:37:54,421][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:37:54,749][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:37:55,080][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:37:55,408][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:37:55,736][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:37:56,067][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:37:56,399][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:37:56,726][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:37:57,053][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:37:57,384][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:37:57,712][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:37:58,040][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:37:58,366][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:37:58,693][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:37:59,022][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:37:59,348][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:37:59,676][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:38:00,003][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:38:00,330][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:38:00,658][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:38:00,985][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:38:01,312][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:38:01,645][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:38:01,972][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:38:02,298][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:38:02,625][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:38:02,964][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:38:03,291][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:38:03,620][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:38:03,947][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:38:04,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:38:04,952][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:38:05,694][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:38:05,696][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:38:05,697][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:38:06,579][__main__][INFO] - Iteration 548 took 24s (41.24% Gen, 55.08% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 29m 24s. Estimated total time: 20h 0m 40s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 1s, 500 more iterations: 3h 20m 6s. [2025-11-13 11:38:06,582][__main__][INFO] - Starting iteration 548. [2025-11-13 11:38:06,586][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:38:06,586][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:38:16,770][__main__][INFO] - Number of regex retries in iteration 548: 0 [2025-11-13 11:38:16,770][__main__][INFO] - agents played in iteration 548 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:38:17,205][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:17,246][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:17,286][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:17,325][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:17,326][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:38:17,327][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:38:18,034][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:38:18,333][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:38:18,664][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:38:18,994][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:38:19,330][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:38:19,660][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:38:19,990][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:38:20,323][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:38:20,657][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:38:20,990][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:38:21,320][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:38:21,649][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:38:21,974][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:38:22,301][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:38:22,628][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:38:22,956][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:38:23,289][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:38:23,619][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:38:23,949][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:38:24,276][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:38:24,604][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:38:24,932][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:38:25,261][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:38:25,589][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:38:25,917][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:38:26,246][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:38:26,572][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:38:26,898][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:38:27,229][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:38:27,555][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:38:27,881][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:38:28,208][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:38:28,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:38:29,199][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:38:29,920][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:38:29,922][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:38:29,923][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:38:30,820][__main__][INFO] - Iteration 549 took 24s (42.02% Gen, 54.27% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 40m 6s. Estimated total time: 20h 11m 48s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 23s, 500 more iterations: 3h 21m 58s. [2025-11-13 11:38:30,822][__main__][INFO] - Starting iteration 549. [2025-11-13 11:38:30,827][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:38:30,827][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:38:41,370][__main__][INFO] - Number of regex retries in iteration 549: 0 [2025-11-13 11:38:41,371][__main__][INFO] - agents played in iteration 549 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:38:41,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:41,859][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:41,915][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:41,955][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:41,956][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:38:41,956][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:38:42,677][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:38:42,976][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:38:43,305][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:38:43,636][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:38:43,966][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:38:44,297][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:38:44,632][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:38:44,960][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:38:45,294][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:38:45,628][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:38:45,964][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:38:46,291][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:38:46,620][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:38:46,948][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:38:47,283][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:38:47,613][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:38:47,940][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:38:48,273][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:38:48,604][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:38:48,932][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:38:49,261][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:38:49,588][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:38:49,914][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:38:50,243][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:38:50,573][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:38:50,907][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:38:51,236][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:38:51,566][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:38:51,899][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:38:52,225][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:38:52,560][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:38:52,888][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:38:53,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:38:53,899][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:38:54,620][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:38:54,622][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:38:54,623][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:38:55,566][__main__][INFO] - Iteration 550 took 24s (42.62% Gen, 53.57% Train). Generation: 10s, Training: 13s. Estimated remaining time: 20h 4m 56s. Estimated total time: 20h 37m 2s. Time estimates for 10 more iterations: 4m 7s, 100 more iterations: 41m 14s, 500 more iterations: 3h 26m 10s. [2025-11-13 11:38:55,568][__main__][INFO] - Starting iteration 550. [2025-11-13 11:38:55,572][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:38:55,572][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:39:05,483][__main__][INFO] - Number of regex retries in iteration 550: 0 [2025-11-13 11:39:05,484][__main__][INFO] - agents played in iteration 550 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:39:05,911][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:05,953][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:05,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:06,033][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:06,034][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:39:06,034][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:39:06,737][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:39:07,037][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:39:07,368][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:39:07,697][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:39:08,025][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:39:08,363][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:39:08,692][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:39:09,024][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:39:09,351][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:39:09,686][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:39:10,015][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:39:10,342][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:39:10,669][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:39:10,998][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:39:11,326][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:39:11,654][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:39:11,986][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:39:12,307][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:39:12,634][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:39:12,961][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:39:13,291][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:39:13,616][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:39:13,950][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:39:14,278][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:39:14,608][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:39:14,934][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:39:15,263][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:39:15,591][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:39:15,919][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:39:16,256][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:39:16,591][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:39:16,924][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:39:17,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:39:17,922][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:39:18,651][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:39:18,652][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:39:18,654][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:39:20,394][__main__][INFO] - Iteration 551 took 24s (39.93% Gen, 53.06% Train). Generation: 9s, Training: 13s. Estimated remaining time: 20h 8m 39s. Estimated total time: 20h 41m 10s. Time estimates for 10 more iterations: 4m 8s, 100 more iterations: 41m 22s, 500 more iterations: 3h 26m 51s. [2025-11-13 11:39:20,396][__main__][INFO] - Starting iteration 551. [2025-11-13 11:39:20,400][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:39:20,401][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:39:31,391][__main__][INFO] - Number of regex retries in iteration 551: 0 [2025-11-13 11:39:31,392][__main__][INFO] - agents played in iteration 551 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:39:31,837][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:31,890][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:31,934][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:31,974][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:31,975][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:39:31,975][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:39:32,711][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:39:33,012][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:39:33,355][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:39:33,685][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:39:34,013][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:39:34,341][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:39:34,683][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:39:35,012][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:39:35,340][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:39:35,669][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:39:35,998][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:39:36,325][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:39:36,655][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:39:36,989][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:39:37,312][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:39:37,641][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:39:37,970][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:39:38,296][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:39:38,625][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:39:38,953][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:39:39,281][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:39:39,611][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:39:39,941][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:39:40,267][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:39:40,594][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:39:40,921][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:39:41,258][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:39:41,584][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:39:41,912][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:39:42,246][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:39:42,578][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:39:42,904][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:39:43,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:39:43,897][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:39:44,616][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:39:44,618][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:39:44,620][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:39:45,603][__main__][INFO] - Iteration 552 took 25s (43.61% Gen, 52.49% Train). Generation: 10s, Training: 13s. Estimated remaining time: 20h 27m 14s. Estimated total time: 21h 0m 10s. Time estimates for 10 more iterations: 4m 12s, 100 more iterations: 42m 0s, 500 more iterations: 3h 30m 1s. [2025-11-13 11:39:45,605][__main__][INFO] - Starting iteration 552. [2025-11-13 11:39:45,608][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:39:45,609][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:39:55,682][__main__][INFO] - Number of regex retries in iteration 552: 0 [2025-11-13 11:39:55,683][__main__][INFO] - agents played in iteration 552 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:39:56,101][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:56,142][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:56,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:56,221][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:56,222][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:39:56,222][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:39:56,970][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:39:57,271][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:39:57,606][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:39:57,936][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:39:58,268][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:39:58,599][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:39:58,931][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:39:59,262][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:39:59,597][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:39:59,930][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:40:00,261][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:40:00,591][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:40:00,918][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:40:01,245][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:40:01,576][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:40:01,907][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:40:02,237][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:40:02,566][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:40:02,893][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:40:03,222][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:40:03,553][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:40:03,885][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:40:04,220][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:40:04,551][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:40:04,885][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:40:05,213][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:40:05,549][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:40:05,887][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:40:06,220][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:40:06,549][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:40:06,876][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:40:07,206][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:40:07,535][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:40:08,214][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:40:08,973][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:40:08,975][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:40:08,977][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:40:09,863][__main__][INFO] - Iteration 553 took 24s (41.53% Gen, 54.81% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 39m 29s. Estimated total time: 20h 12m 49s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 25s, 500 more iterations: 3h 22m 8s. [2025-11-13 11:40:09,866][__main__][INFO] - Starting iteration 553. [2025-11-13 11:40:09,869][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:40:09,869][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:40:20,429][__main__][INFO] - Number of regex retries in iteration 553: 0 [2025-11-13 11:40:20,429][__main__][INFO] - agents played in iteration 553 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:40:20,854][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:20,894][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:20,933][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:20,973][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:20,974][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:40:20,974][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:40:21,714][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:40:22,016][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:40:22,347][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:40:22,677][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:40:23,011][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:40:23,339][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:40:23,669][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:40:23,999][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:40:24,328][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:40:24,654][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:40:24,980][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:40:25,310][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:40:25,640][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:40:25,971][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:40:26,299][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:40:26,629][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:40:26,965][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:40:27,295][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:40:27,630][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:40:27,959][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:40:28,289][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:40:28,617][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:40:28,951][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:40:29,284][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:40:29,616][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:40:29,941][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:40:30,272][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:40:30,600][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:40:30,934][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:40:31,271][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:40:31,603][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:40:31,933][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:40:32,262][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:40:32,942][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:40:33,679][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:40:33,681][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:40:33,682][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:40:34,568][__main__][INFO] - Iteration 554 took 24s (42.75% Gen, 53.66% Train). Generation: 10s, Training: 13s. Estimated remaining time: 20h 1m 15s. Estimated total time: 20h 35m 0s. Time estimates for 10 more iterations: 4m 7s, 100 more iterations: 41m 10s, 500 more iterations: 3h 25m 50s. [2025-11-13 11:40:34,570][__main__][INFO] - Starting iteration 554. [2025-11-13 11:40:34,574][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:40:34,575][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:40:44,436][__main__][INFO] - Number of regex retries in iteration 554: 0 [2025-11-13 11:40:44,437][__main__][INFO] - agents played in iteration 554 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:40:44,861][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:44,903][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:44,944][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:44,984][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:44,985][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:40:44,985][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:40:45,716][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:40:46,014][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:40:46,342][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:40:46,669][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:40:46,996][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:40:47,323][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:40:47,651][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:40:47,982][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:40:48,313][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:40:48,647][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:40:48,975][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:40:49,304][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:40:49,632][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:40:49,959][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:40:50,286][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:40:50,615][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:40:50,941][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:40:51,269][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:40:51,597][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:40:51,930][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:40:52,257][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:40:52,588][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:40:52,925][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:40:53,250][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:40:53,577][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:40:53,907][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:40:54,234][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:40:54,564][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:40:54,893][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:40:55,224][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:40:55,557][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:40:55,900][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:40:56,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:40:56,894][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:40:57,616][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:40:57,617][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:40:57,619][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:40:58,557][__main__][INFO] - Iteration 555 took 23s (41.12% Gen, 54.96% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 25m 2s. Estimated total time: 19h 59m 11s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 58s, 500 more iterations: 3h 19m 51s. [2025-11-13 11:40:58,559][__main__][INFO] - Starting iteration 555. [2025-11-13 11:40:58,562][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:40:58,563][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:41:08,620][__main__][INFO] - Number of regex retries in iteration 555: 0 [2025-11-13 11:41:08,621][__main__][INFO] - agents played in iteration 555 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:41:09,064][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:09,105][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:09,145][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:09,185][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:09,185][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:41:09,186][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:41:09,901][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:41:10,207][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:41:10,531][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:41:10,862][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:41:11,192][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:41:11,519][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:41:11,849][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:41:12,176][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:41:12,507][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:41:12,843][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:41:13,177][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:41:13,508][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:41:13,841][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:41:14,176][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:41:14,507][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:41:14,839][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:41:15,172][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:41:15,511][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:41:15,842][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:41:16,177][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:41:16,507][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:41:16,838][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:41:17,167][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:41:17,494][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:41:17,822][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:41:18,149][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:41:18,476][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:41:18,804][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:41:19,133][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:41:19,459][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:41:19,799][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:41:20,128][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:41:20,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:41:21,115][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:41:21,836][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:41:21,837][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:41:21,839][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:41:22,727][__main__][INFO] - Iteration 556 took 24s (41.62% Gen, 54.70% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 33m 42s. Estimated total time: 20h 8m 15s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 16s, 500 more iterations: 3h 21m 22s. [2025-11-13 11:41:22,729][__main__][INFO] - Starting iteration 556. [2025-11-13 11:41:22,732][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:41:22,733][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:41:32,538][__main__][INFO] - Number of regex retries in iteration 556: 0 [2025-11-13 11:41:32,539][__main__][INFO] - agents played in iteration 556 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:41:32,978][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:33,019][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:33,058][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:33,098][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:33,099][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:41:33,099][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:41:33,803][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:41:34,101][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:41:34,430][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:41:34,758][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:41:35,086][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:41:35,415][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:41:35,744][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:41:36,079][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:41:36,409][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:41:36,740][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:41:37,076][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:41:37,407][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:41:37,735][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:41:38,062][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:41:38,392][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:41:38,721][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:41:39,050][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:41:39,377][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:41:39,703][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:41:40,030][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:41:40,358][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:41:40,684][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:41:41,013][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:41:41,339][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:41:41,667][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:41:41,994][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:41:42,321][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:41:42,647][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:41:42,975][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:41:43,302][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:41:43,629][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:41:43,956][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:41:44,283][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:41:44,954][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:41:45,685][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:41:45,686][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:41:45,688][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:41:46,606][__main__][INFO] - Iteration 557 took 23s (41.07% Gen, 55.08% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 18m 48s. Estimated total time: 19h 53m 45s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 47s, 500 more iterations: 3h 18m 57s. [2025-11-13 11:41:46,608][__main__][INFO] - Starting iteration 557. [2025-11-13 11:41:46,612][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:41:46,612][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:41:56,549][__main__][INFO] - Number of regex retries in iteration 557: 0 [2025-11-13 11:41:56,550][__main__][INFO] - agents played in iteration 557 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:41:56,978][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:57,017][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:57,057][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:57,097][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:57,097][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:41:57,098][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:41:57,803][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:41:58,105][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:41:58,433][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:41:58,761][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:41:59,089][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:41:59,418][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:41:59,749][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:42:00,080][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:42:00,411][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:42:00,741][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:42:01,069][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:42:01,400][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:42:01,729][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:42:02,056][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:42:02,385][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:42:02,716][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:42:03,045][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:42:03,373][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:42:03,703][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:42:04,032][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:42:04,363][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:42:04,696][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:42:05,024][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:42:05,355][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:42:05,684][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:42:06,015][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:42:06,349][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:42:06,676][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:42:07,005][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:42:07,336][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:42:07,667][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:42:07,995][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:42:08,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:42:09,026][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:42:09,764][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:42:09,766][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:42:09,768][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:42:10,650][__main__][INFO] - Iteration 558 took 24s (41.34% Gen, 54.99% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 26m 36s. Estimated total time: 20h 1m 57s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 3s, 500 more iterations: 3h 20m 19s. [2025-11-13 11:42:10,652][__main__][INFO] - Starting iteration 558. [2025-11-13 11:42:10,656][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:42:10,656][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:42:20,093][__main__][INFO] - Number of regex retries in iteration 558: 0 [2025-11-13 11:42:20,094][__main__][INFO] - agents played in iteration 558 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:42:20,527][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:20,568][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:20,608][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:20,648][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:20,648][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:42:20,649][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:42:21,363][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:42:21,664][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:42:22,006][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:42:22,336][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:42:22,667][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:42:23,003][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:42:23,330][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:42:23,659][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:42:23,990][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:42:24,322][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:42:24,646][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:42:24,975][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:42:25,309][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:42:25,644][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:42:25,979][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:42:26,309][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:42:26,636][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:42:26,965][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:42:27,292][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:42:27,618][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:42:27,945][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:42:28,276][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:42:28,602][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:42:28,929][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:42:29,256][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:42:29,583][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:42:29,910][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:42:30,238][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:42:30,564][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:42:30,894][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:42:31,223][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:42:31,550][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:42:31,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:42:32,555][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:42:33,270][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:42:33,271][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:42:33,273][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:42:34,152][__main__][INFO] - Iteration 559 took 23s (40.16% Gen, 56.09% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 59m 7s. Estimated total time: 19h 34m 51s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 9s, 500 more iterations: 3h 15m 48s. [2025-11-13 11:42:34,154][__main__][INFO] - Starting iteration 559. [2025-11-13 11:42:34,157][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:42:34,158][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:42:43,378][__main__][INFO] - Number of regex retries in iteration 559: 0 [2025-11-13 11:42:43,379][__main__][INFO] - agents played in iteration 559 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:42:43,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:43,853][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:43,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:43,933][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:43,934][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:42:43,934][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:42:44,686][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:42:44,985][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:42:45,318][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:42:45,649][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:42:45,977][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:42:46,306][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:42:46,634][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:42:46,964][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:42:47,299][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:42:47,631][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:42:47,962][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:42:48,293][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:42:48,623][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:42:48,953][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:42:49,281][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:42:49,613][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:42:49,940][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:42:50,274][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:42:50,608][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:42:50,939][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:42:51,273][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:42:51,603][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:42:51,937][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:42:52,279][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:42:52,614][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:42:52,949][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:42:53,280][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:42:53,616][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:42:53,946][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:42:54,273][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:42:54,605][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:42:54,935][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:42:55,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:42:55,983][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:42:56,719][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:42:56,720][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:42:56,722][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:42:57,686][__main__][INFO] - Iteration 560 took 23s (39.19% Gen, 56.70% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 0m 20s. Estimated total time: 19h 36m 28s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 12s, 500 more iterations: 3h 16m 4s. [2025-11-13 11:42:57,688][__main__][INFO] - Starting iteration 560. [2025-11-13 11:42:57,692][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:42:57,692][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:43:07,122][__main__][INFO] - Number of regex retries in iteration 560: 0 [2025-11-13 11:43:07,123][__main__][INFO] - agents played in iteration 560 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:43:07,563][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:07,604][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:07,644][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:07,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:07,693][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:43:07,693][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:43:08,453][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:43:08,753][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:43:09,082][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:43:09,413][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:43:09,743][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:43:10,073][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:43:10,406][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:43:10,739][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:43:11,070][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:43:11,399][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:43:11,725][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:43:12,054][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:43:12,382][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:43:12,717][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:43:13,046][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:43:13,373][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:43:13,709][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:43:14,038][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:43:14,372][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:43:14,710][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:43:15,040][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:43:15,371][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:43:15,701][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:43:16,036][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:43:16,368][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:43:16,699][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:43:17,032][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:43:17,358][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:43:17,687][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:43:18,013][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:43:18,345][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:43:18,675][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:43:19,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:43:19,678][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:43:20,400][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:43:20,401][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:43:20,403][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:43:22,129][__main__][INFO] - Iteration 561 took 24s (38.59% Gen, 54.34% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 45m 22s. Estimated total time: 20h 21m 55s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 43s, 500 more iterations: 3h 23m 39s. [2025-11-13 11:43:22,131][__main__][INFO] - Starting iteration 561. [2025-11-13 11:43:22,135][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:43:22,135][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:43:32,322][__main__][INFO] - Number of regex retries in iteration 561: 0 [2025-11-13 11:43:32,322][__main__][INFO] - agents played in iteration 561 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:43:32,764][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:32,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:32,845][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:32,886][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:32,886][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:43:32,887][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:43:33,628][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:43:33,935][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:43:34,259][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:43:34,588][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:43:34,915][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:43:35,246][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:43:35,574][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:43:35,902][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:43:36,231][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:43:36,559][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:43:36,888][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:43:37,221][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:43:37,552][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:43:37,883][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:43:38,217][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:43:38,549][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:43:38,877][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:43:39,206][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:43:39,535][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:43:39,863][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:43:40,191][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:43:40,518][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:43:40,849][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:43:41,180][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:43:41,512][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:43:41,845][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:43:42,173][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:43:42,498][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:43:42,824][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:43:43,155][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:43:43,484][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:43:43,815][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:43:44,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:43:44,818][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:43:45,536][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:43:45,538][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:43:45,540][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:43:46,466][__main__][INFO] - Iteration 562 took 24s (41.87% Gen, 54.32% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 39m 40s. Estimated total time: 20h 16m 36s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 33s, 500 more iterations: 3h 22m 46s. [2025-11-13 11:43:46,468][__main__][INFO] - Starting iteration 562. [2025-11-13 11:43:46,471][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:43:46,472][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:43:56,362][__main__][INFO] - Number of regex retries in iteration 562: 0 [2025-11-13 11:43:56,363][__main__][INFO] - agents played in iteration 562 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:43:56,812][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:56,852][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:56,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:56,933][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:56,933][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:43:56,934][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:43:57,662][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:43:57,962][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:43:58,291][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:43:58,620][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:43:58,948][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:43:59,277][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:43:59,606][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:43:59,940][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:44:00,271][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:44:00,598][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:44:00,933][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:44:01,261][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:44:01,590][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:44:01,921][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:44:02,248][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:44:02,576][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:44:02,902][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:44:03,232][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:44:03,560][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:44:03,888][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:44:04,217][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:44:04,550][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:44:04,873][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:44:05,203][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:44:05,530][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:44:05,860][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:44:06,184][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:44:06,513][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:44:06,841][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:44:07,168][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:44:07,497][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:44:07,823][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:44:08,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:44:08,826][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:44:09,549][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:44:09,551][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:44:09,552][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:44:10,483][__main__][INFO] - Iteration 563 took 24s (41.19% Gen, 54.93% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 23m 17s. Estimated total time: 20h 0m 37s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 1s, 500 more iterations: 3h 20m 6s. [2025-11-13 11:44:10,485][__main__][INFO] - Starting iteration 563. [2025-11-13 11:44:10,488][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:44:10,489][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:44:19,954][__main__][INFO] - Number of regex retries in iteration 563: 0 [2025-11-13 11:44:19,954][__main__][INFO] - agents played in iteration 563 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:44:20,388][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:20,429][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:20,469][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:20,510][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:20,511][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:44:20,512][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:44:21,244][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:44:21,544][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:44:21,874][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:44:22,203][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:44:22,536][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:44:22,865][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:44:23,193][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:44:23,521][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:44:23,851][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:44:24,180][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:44:24,513][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:44:24,841][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:44:25,169][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:44:25,503][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:44:25,844][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:44:26,173][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:44:26,500][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:44:26,828][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:44:27,155][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:44:27,482][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:44:27,810][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:44:28,139][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:44:28,467][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:44:28,793][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:44:29,121][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:44:29,448][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:44:29,778][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:44:30,107][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:44:30,434][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:44:30,765][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:44:31,092][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:44:31,419][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:44:31,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:44:32,456][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:44:33,182][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:44:33,184][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:44:33,185][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:44:34,052][__main__][INFO] - Iteration 564 took 23s (40.16% Gen, 56.15% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 0m 29s. Estimated total time: 19h 38m 13s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 16s, 500 more iterations: 3h 16m 22s. [2025-11-13 11:44:34,053][__main__][INFO] - Starting iteration 564. [2025-11-13 11:44:34,057][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:44:34,058][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:44:44,016][__main__][INFO] - Number of regex retries in iteration 564: 0 [2025-11-13 11:44:44,017][__main__][INFO] - agents played in iteration 564 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:44:44,458][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:44,498][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:44,539][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:44,579][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:44,580][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:44:44,581][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:44:45,285][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:44:45,588][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:44:45,918][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:44:46,248][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:44:46,580][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:44:46,917][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:44:47,246][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:44:47,575][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:44:47,906][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:44:48,239][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:44:48,574][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:44:48,909][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:44:49,240][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:44:49,571][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:44:49,901][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:44:50,227][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:44:50,555][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:44:50,883][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:44:51,211][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:44:51,539][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:44:51,866][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:44:52,195][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:44:52,523][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:44:52,852][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:44:53,184][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:44:53,513][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:44:53,847][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:44:54,181][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:44:54,507][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:44:54,835][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:44:55,163][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:44:55,491][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:44:55,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:44:56,512][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:44:57,236][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:44:57,240][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:44:57,242][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:44:58,154][__main__][INFO] - Iteration 565 took 24s (41.33% Gen, 54.88% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 26m 47s. Estimated total time: 20h 4m 55s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 9s, 500 more iterations: 3h 20m 49s. [2025-11-13 11:44:58,156][__main__][INFO] - Starting iteration 565. [2025-11-13 11:44:58,159][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:44:58,160][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:45:04,778][mllm.models.large_language_model_local][WARNING] - Response bụ did not match regex: (|), retry 1/1 [2025-11-13 11:45:08,736][__main__][INFO] - Number of regex retries in iteration 565: 1 [2025-11-13 11:45:08,736][__main__][INFO] - agents played in iteration 565 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:45:09,167][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:09,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:09,234][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:09,282][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:09,283][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:45:09,283][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:45:09,976][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:45:10,274][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:45:10,601][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:45:10,928][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:45:11,258][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:45:11,584][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:45:11,914][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:45:12,242][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:45:12,569][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:45:12,895][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:45:13,219][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:45:13,545][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:45:13,870][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:45:14,196][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:45:14,522][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:45:14,856][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:45:15,182][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:45:15,506][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:45:15,833][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:45:16,156][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:45:16,482][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:45:16,810][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:45:17,141][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:45:17,467][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:45:17,791][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:45:18,117][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:45:18,442][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:45:18,770][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:45:19,096][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:45:19,427][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:45:19,761][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:45:20,081][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:45:20,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:45:21,116][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:45:21,803][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:45:21,804][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:45:21,806][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:45:22,663][__main__][INFO] - Iteration 566 took 24s (43.16% Gen, 53.34% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 46m 42s. Estimated total time: 20h 25m 15s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 50s, 500 more iterations: 3h 24m 12s. [2025-11-13 11:45:22,665][__main__][INFO] - Starting iteration 566. [2025-11-13 11:45:22,669][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:45:22,669][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:45:32,990][__main__][INFO] - Number of regex retries in iteration 566: 0 [2025-11-13 11:45:32,990][__main__][INFO] - agents played in iteration 566 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:45:33,429][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:33,462][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:33,495][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:33,528][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:33,528][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:45:33,529][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:45:34,243][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:45:34,540][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:45:34,869][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:45:35,198][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:45:35,526][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:45:35,854][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:45:36,185][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:45:36,512][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:45:36,844][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:45:37,169][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:45:37,499][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:45:37,830][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:45:38,160][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:45:38,483][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:45:38,808][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:45:39,139][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:45:39,466][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:45:39,790][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:45:40,114][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:45:40,444][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:45:40,768][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:45:41,093][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:45:41,417][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:45:41,743][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:45:42,067][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:45:42,397][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:45:42,728][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:45:43,055][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:45:43,386][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:45:43,711][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:45:44,036][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:45:44,366][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:45:44,695][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:45:45,433][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:45:46,135][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:45:46,137][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:45:46,138][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:45:46,938][__main__][INFO] - Iteration 567 took 24s (42.53% Gen, 54.17% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 34m 34s. Estimated total time: 20h 13m 31s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 27s, 500 more iterations: 3h 22m 15s. [2025-11-13 11:45:46,940][__main__][INFO] - Starting iteration 567. [2025-11-13 11:45:46,943][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:45:46,943][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:45:56,653][__main__][INFO] - Number of regex retries in iteration 567: 0 [2025-11-13 11:45:56,654][__main__][INFO] - agents played in iteration 567 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:45:57,082][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:57,117][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:57,150][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:57,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:57,184][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:45:57,184][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:45:57,883][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:45:58,179][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:45:58,505][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:45:58,831][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:45:59,157][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:45:59,484][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:45:59,811][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:46:00,140][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:46:00,466][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:46:00,793][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:46:01,125][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:46:01,450][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:46:01,775][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:46:02,104][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:46:02,435][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:46:02,765][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:46:03,092][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:46:03,419][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:46:03,743][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:46:04,075][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:46:04,403][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:46:04,729][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:46:05,055][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:46:05,382][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:46:05,706][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:46:06,033][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:46:06,360][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:46:06,687][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:46:07,018][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:46:07,347][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:46:07,672][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:46:07,996][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:46:08,327][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:46:09,042][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:46:09,758][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:46:09,760][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:46:09,762][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:46:10,561][__main__][INFO] - Iteration 568 took 23s (41.11% Gen, 55.50% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 1m 35s. Estimated total time: 19h 40m 56s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 21s, 500 more iterations: 3h 16m 49s. [2025-11-13 11:46:10,563][__main__][INFO] - Starting iteration 568. [2025-11-13 11:46:10,566][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:46:10,566][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:46:20,667][__main__][INFO] - Number of regex retries in iteration 568: 0 [2025-11-13 11:46:20,668][__main__][INFO] - agents played in iteration 568 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:46:21,105][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:21,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:21,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:21,203][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:21,204][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:46:21,204][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:46:21,902][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:46:22,199][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:46:22,525][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:46:22,851][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:46:23,180][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:46:23,506][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:46:23,830][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:46:24,158][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:46:24,486][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:46:24,812][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:46:25,136][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:46:25,464][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:46:25,789][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:46:26,115][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:46:26,440][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:46:26,766][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:46:27,094][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:46:27,419][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:46:27,745][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:46:28,070][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:46:28,395][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:46:28,720][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:46:29,046][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:46:29,370][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:46:29,694][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:46:30,019][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:46:30,344][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:46:30,670][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:46:30,998][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:46:31,323][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:46:31,649][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:46:31,976][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:46:32,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:46:33,007][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:46:33,689][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:46:33,692][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:46:33,693][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:46:34,537][__main__][INFO] - Iteration 569 took 23s (42.14% Gen, 54.33% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 18m 52s. Estimated total time: 19h 58m 37s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 57s, 500 more iterations: 3h 19m 46s. [2025-11-13 11:46:34,539][__main__][INFO] - Starting iteration 569. [2025-11-13 11:46:34,542][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:46:34,542][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:46:43,766][__main__][INFO] - Number of regex retries in iteration 569: 0 [2025-11-13 11:46:43,767][__main__][INFO] - agents played in iteration 569 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:46:44,207][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:44,240][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:44,273][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:44,306][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:44,306][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:46:44,306][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:46:45,006][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:46:45,303][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:46:45,629][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:46:45,955][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:46:46,280][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:46:46,605][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:46:46,931][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:46:47,258][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:46:47,582][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:46:47,909][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:46:48,233][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:46:48,557][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:46:48,884][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:46:49,209][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:46:49,535][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:46:49,860][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:46:50,183][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:46:50,508][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:46:50,832][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:46:51,162][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:46:51,491][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:46:51,816][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:46:52,143][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:46:52,467][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:46:52,792][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:46:53,119][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:46:53,443][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:46:53,771][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:46:54,097][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:46:54,423][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:46:54,747][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:46:55,073][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:46:55,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:46:56,104][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:46:56,792][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:46:56,794][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:46:56,796][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:46:57,623][__main__][INFO] - Iteration 570 took 23s (39.96% Gen, 56.45% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 33m 57s. Estimated total time: 19h 14m 5s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 28s, 500 more iterations: 3h 12m 20s. [2025-11-13 11:46:57,625][__main__][INFO] - Starting iteration 570. [2025-11-13 11:46:57,628][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:46:57,628][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:47:07,387][__main__][INFO] - Number of regex retries in iteration 570: 0 [2025-11-13 11:47:07,388][__main__][INFO] - agents played in iteration 570 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:47:07,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:07,856][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:07,888][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:07,921][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:07,922][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:47:07,922][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:47:08,618][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:47:08,917][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:47:09,244][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:47:09,571][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:47:09,897][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:47:10,225][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:47:10,562][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:47:10,888][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:47:11,216][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:47:11,542][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:47:11,874][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:47:12,200][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:47:12,526][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:47:12,853][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:47:13,179][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:47:13,504][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:47:13,828][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:47:14,157][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:47:14,477][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:47:14,802][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:47:15,127][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:47:15,458][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:47:15,779][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:47:16,105][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:47:16,431][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:47:16,760][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:47:17,083][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:47:17,408][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:47:17,734][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:47:18,061][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:47:18,385][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:47:18,711][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:47:19,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:47:19,752][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:47:20,436][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:47:20,437][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:47:20,438][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:47:22,019][__main__][INFO] - Iteration 571 took 24s (40.01% Gen, 53.50% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 39m 4s. Estimated total time: 20h 19m 36s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 39s, 500 more iterations: 3h 23m 16s. [2025-11-13 11:47:22,021][__main__][INFO] - Starting iteration 571. [2025-11-13 11:47:22,025][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:47:22,026][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:47:31,548][__main__][INFO] - Number of regex retries in iteration 571: 0 [2025-11-13 11:47:31,548][__main__][INFO] - agents played in iteration 571 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:47:31,974][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:32,007][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:32,040][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:32,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:32,073][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:47:32,073][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:47:32,773][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:47:33,071][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:47:33,399][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:47:33,725][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:47:34,054][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:47:34,380][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:47:34,707][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:47:35,034][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:47:35,360][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:47:35,685][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:47:36,012][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:47:36,339][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:47:36,666][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:47:36,997][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:47:37,321][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:47:37,647][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:47:37,973][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:47:38,298][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:47:38,627][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:47:38,958][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:47:39,282][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:47:39,607][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:47:39,933][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:47:40,258][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:47:40,590][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:47:40,920][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:47:41,247][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:47:41,575][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:47:41,900][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:47:42,227][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:47:42,553][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:47:42,884][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:47:43,210][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:47:43,934][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:47:44,635][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:47:44,638][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:47:44,640][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:47:45,433][__main__][INFO] - Iteration 572 took 23s (40.67% Gen, 55.92% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 49m 32s. Estimated total time: 19h 30m 28s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 0s, 500 more iterations: 3h 15m 4s. [2025-11-13 11:47:45,435][__main__][INFO] - Starting iteration 572. [2025-11-13 11:47:45,439][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:47:45,439][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:47:54,572][__main__][INFO] - Number of regex retries in iteration 572: 0 [2025-11-13 11:47:54,572][__main__][INFO] - agents played in iteration 572 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:47:55,006][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:55,039][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:55,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:55,105][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:55,105][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:47:55,105][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:47:55,799][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:47:56,095][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:47:56,428][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:47:56,756][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:47:57,085][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:47:57,417][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:47:57,751][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:47:58,079][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:47:58,405][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:47:58,734][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:47:59,067][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:47:59,397][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:47:59,725][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:48:00,052][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:48:00,380][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:48:00,704][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:48:01,028][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:48:01,353][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:48:01,682][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:48:02,009][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:48:02,334][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:48:02,661][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:48:02,988][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:48:03,312][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:48:03,639][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:48:03,966][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:48:04,294][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:48:04,620][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:48:04,945][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:48:05,271][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:48:05,599][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:48:05,924][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:48:06,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:48:06,963][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:48:07,648][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:48:07,649][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:48:07,651][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:48:08,475][__main__][INFO] - Iteration 573 took 23s (39.64% Gen, 56.77% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 30m 34s. Estimated total time: 19h 11m 52s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 23s, 500 more iterations: 3h 11m 58s. [2025-11-13 11:48:08,477][__main__][INFO] - Starting iteration 573. [2025-11-13 11:48:08,480][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:48:08,481][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:48:18,120][__main__][INFO] - Number of regex retries in iteration 573: 0 [2025-11-13 11:48:18,120][__main__][INFO] - agents played in iteration 573 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:48:18,567][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:18,600][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:18,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:18,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:18,667][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:48:18,668][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:48:19,362][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:48:19,661][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:48:19,988][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:48:20,316][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:48:20,643][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:48:20,970][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:48:21,295][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:48:21,628][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:48:21,955][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:48:22,286][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:48:22,616][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:48:22,944][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:48:23,277][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:48:23,606][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:48:23,935][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:48:24,267][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:48:24,594][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:48:24,923][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:48:25,249][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:48:25,578][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:48:25,903][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:48:26,230][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:48:26,556][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:48:26,881][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:48:27,208][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:48:27,534][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:48:27,863][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:48:28,188][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:48:28,513][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:48:28,841][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:48:29,168][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:48:29,495][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:48:29,821][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:48:30,528][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:48:31,211][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:48:31,213][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:48:31,215][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:48:32,008][__main__][INFO] - Iteration 574 took 23s (40.97% Gen, 55.65% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 54m 44s. Estimated total time: 19h 36m 27s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 12s, 500 more iterations: 3h 16m 4s. [2025-11-13 11:48:32,010][__main__][INFO] - Starting iteration 574. [2025-11-13 11:48:32,013][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:48:32,013][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:48:40,474][__main__][INFO] - Number of regex retries in iteration 574: 0 [2025-11-13 11:48:40,475][__main__][INFO] - agents played in iteration 574 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:48:40,931][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:40,964][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:40,997][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:41,030][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:41,030][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:48:41,031][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:48:41,739][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:48:42,038][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:48:42,371][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:48:42,699][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:48:43,026][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:48:43,352][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:48:43,682][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:48:44,013][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:48:44,341][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:48:44,672][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:48:45,000][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:48:45,329][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:48:45,662][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:48:45,994][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:48:46,321][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:48:46,653][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:48:46,980][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:48:47,309][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:48:47,636][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:48:47,967][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:48:48,294][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:48:48,621][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:48:48,951][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:48:49,279][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:48:49,607][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:48:49,937][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:48:50,269][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:48:50,596][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:48:50,922][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:48:51,251][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:48:51,585][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:48:51,912][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:48:52,238][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:48:52,940][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:48:53,623][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:48:53,625][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:48:53,627][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:48:54,503][__main__][INFO] - Iteration 575 took 22s (37.62% Gen, 58.48% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 2m 27s. Estimated total time: 18h 44m 32s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 29s, 500 more iterations: 3h 7m 25s. [2025-11-13 11:48:54,505][__main__][INFO] - Starting iteration 575. [2025-11-13 11:48:54,508][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:48:54,509][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:49:03,558][__main__][INFO] - Number of regex retries in iteration 575: 0 [2025-11-13 11:49:03,559][__main__][INFO] - agents played in iteration 575 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:49:03,994][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:04,027][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:04,060][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:04,093][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:04,094][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:49:04,095][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:49:04,815][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:49:05,113][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:49:05,441][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:49:05,765][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:49:06,095][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:49:06,420][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:49:06,748][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:49:07,075][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:49:07,400][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:49:07,725][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:49:08,051][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:49:08,380][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:49:08,704][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:49:09,034][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:49:09,362][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:49:09,692][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:49:10,018][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:49:10,344][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:49:10,668][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:49:10,997][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:49:11,325][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:49:11,655][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:49:11,980][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:49:12,307][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:49:12,634][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:49:12,962][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:49:13,287][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:49:13,614][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:49:13,942][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:49:14,276][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:49:14,600][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:49:14,926][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:49:15,250][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:49:15,952][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:49:16,678][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:49:16,680][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:49:16,682][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:49:17,541][__main__][INFO] - Iteration 576 took 23s (39.29% Gen, 56.97% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 29m 15s. Estimated total time: 19h 11m 43s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 23s, 500 more iterations: 3h 11m 57s. [2025-11-13 11:49:17,543][__main__][INFO] - Starting iteration 576. [2025-11-13 11:49:17,546][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:49:17,547][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:49:26,234][__main__][INFO] - Number of regex retries in iteration 576: 0 [2025-11-13 11:49:26,235][__main__][INFO] - agents played in iteration 576 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:49:26,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:26,704][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:26,737][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:26,771][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:26,771][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:49:26,772][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:49:27,506][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:49:27,805][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:49:28,131][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:49:28,457][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:49:28,782][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:49:29,111][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:49:29,436][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:49:29,766][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:49:30,098][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:49:30,424][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:49:30,752][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:49:31,083][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:49:31,413][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:49:31,739][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:49:32,067][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:49:32,401][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:49:32,725][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:49:33,051][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:49:33,378][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:49:33,710][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:49:34,029][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:49:34,357][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:49:34,682][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:49:35,008][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:49:35,333][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:49:35,657][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:49:35,982][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:49:36,314][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:49:36,634][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:49:36,961][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:49:37,289][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:49:37,614][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:49:37,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:49:38,748][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:49:39,456][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:49:39,457][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:49:39,459][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:49:40,330][__main__][INFO] - Iteration 577 took 22s (38.13% Gen, 58.04% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 16m 22s. Estimated total time: 18h 59m 12s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 58s, 500 more iterations: 3h 9m 52s. [2025-11-13 11:49:40,332][__main__][INFO] - Starting iteration 577. [2025-11-13 11:49:40,335][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:49:40,336][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:49:49,456][__main__][INFO] - Number of regex retries in iteration 577: 0 [2025-11-13 11:49:49,457][__main__][INFO] - agents played in iteration 577 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:49:49,906][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:49,939][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:49,972][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:50,006][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:50,007][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:49:50,007][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:49:50,721][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:49:51,020][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:49:51,347][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:49:51,675][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:49:52,002][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:49:52,328][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:49:52,657][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:49:52,982][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:49:53,308][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:49:53,634][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:49:53,962][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:49:54,288][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:49:54,615][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:49:54,942][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:49:55,271][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:49:55,598][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:49:55,925][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:49:56,255][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:49:56,580][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:49:56,910][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:49:57,237][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:49:57,565][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:49:57,892][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:49:58,222][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:49:58,547][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:49:58,874][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:49:59,201][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:49:59,528][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:49:59,854][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:50:00,182][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:50:00,511][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:50:00,837][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:50:01,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:50:01,869][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:50:02,575][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:50:02,577][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:50:02,579][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:50:03,448][__main__][INFO] - Iteration 578 took 23s (39.46% Gen, 56.77% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 32m 27s. Estimated total time: 19h 15m 40s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 31s, 500 more iterations: 3h 12m 36s. [2025-11-13 11:50:03,450][__main__][INFO] - Starting iteration 578. [2025-11-13 11:50:03,453][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:50:03,453][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:50:12,073][__main__][INFO] - Number of regex retries in iteration 578: 0 [2025-11-13 11:50:12,074][__main__][INFO] - agents played in iteration 578 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:50:12,511][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:12,544][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:12,577][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:12,610][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:12,611][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:50:12,612][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:50:13,334][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:50:13,632][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:50:13,959][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:50:14,285][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:50:14,619][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:50:14,939][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:50:15,266][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:50:15,591][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:50:15,923][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:50:16,251][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:50:16,583][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:50:16,910][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:50:17,237][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:50:17,564][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:50:17,889][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:50:18,214][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:50:18,540][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:50:18,866][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:50:19,196][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:50:19,524][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:50:19,852][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:50:20,181][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:50:20,506][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:50:20,835][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:50:21,162][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:50:21,500][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:50:21,827][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:50:22,152][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:50:22,478][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:50:22,803][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:50:23,128][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:50:23,453][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:50:23,780][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:50:24,475][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:50:25,183][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:50:25,185][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:50:25,187][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:50:26,003][__main__][INFO] - Iteration 579 took 22s (38.22% Gen, 58.15% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 3m 57s. Estimated total time: 18h 47m 33s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 35s, 500 more iterations: 3h 7m 55s. [2025-11-13 11:50:26,007][__main__][INFO] - Starting iteration 579. [2025-11-13 11:50:26,010][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:50:26,010][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:50:35,282][__main__][INFO] - Number of regex retries in iteration 579: 0 [2025-11-13 11:50:35,283][__main__][INFO] - agents played in iteration 579 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:50:35,736][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:35,769][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:35,802][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:35,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:35,836][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:50:35,836][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:50:36,555][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:50:36,853][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:50:37,181][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:50:37,507][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:50:37,831][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:50:38,158][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:50:38,496][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:50:38,822][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:50:39,147][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:50:39,476][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:50:39,808][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:50:40,137][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:50:40,462][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:50:40,788][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:50:41,117][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:50:41,448][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:50:41,774][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:50:42,098][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:50:42,430][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:50:42,756][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:50:43,082][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:50:43,417][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:50:43,737][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:50:44,062][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:50:44,390][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:50:44,715][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:50:45,040][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:50:45,366][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:50:45,693][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:50:46,020][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:50:46,348][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:50:46,676][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:50:47,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:50:47,716][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:50:48,397][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:50:48,400][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:50:48,401][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:50:49,230][__main__][INFO] - Iteration 580 took 23s (39.93% Gen, 56.49% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 37m 5s. Estimated total time: 19h 21m 4s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 42s, 500 more iterations: 3h 13m 30s. [2025-11-13 11:50:49,232][__main__][INFO] - Starting iteration 580. [2025-11-13 11:50:49,236][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:50:49,236][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:50:57,894][__main__][INFO] - Number of regex retries in iteration 580: 0 [2025-11-13 11:50:57,895][__main__][INFO] - agents played in iteration 580 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:50:58,344][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:58,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:58,410][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:58,443][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:58,444][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:50:58,444][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:50:59,180][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:50:59,478][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:50:59,813][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:51:00,138][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:51:00,465][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:51:00,792][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:51:01,121][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:51:01,446][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:51:01,773][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:51:02,107][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:51:02,428][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:51:02,755][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:51:03,083][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:51:03,412][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:51:03,733][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:51:04,057][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:51:04,383][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:51:04,717][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:51:05,037][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:51:05,363][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:51:05,691][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:51:06,025][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:51:06,348][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:51:06,681][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:51:07,008][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:51:07,337][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:51:07,662][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:51:07,989][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:51:08,315][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:51:08,643][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:51:08,970][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:51:09,296][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:51:09,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:51:10,319][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:51:11,004][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:51:11,006][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:51:11,007][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:51:12,795][__main__][INFO] - Iteration 581 took 23s (36.75% Gen, 55.66% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 53m 37s. Estimated total time: 19h 38m 0s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 16s, 500 more iterations: 3h 16m 20s. [2025-11-13 11:51:12,797][__main__][INFO] - Starting iteration 581. [2025-11-13 11:51:12,801][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:51:12,801][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:51:21,526][__main__][INFO] - Number of regex retries in iteration 581: 0 [2025-11-13 11:51:21,526][__main__][INFO] - agents played in iteration 581 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:51:21,984][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:22,017][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:22,050][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:22,083][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:22,084][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:51:22,084][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:51:22,812][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:51:23,112][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:51:23,439][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:51:23,764][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:51:24,090][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:51:24,417][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:51:24,742][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:51:25,069][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:51:25,394][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:51:25,722][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:51:26,049][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:51:26,375][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:51:26,705][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:51:27,038][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:51:27,361][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:51:27,687][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:51:28,012][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:51:28,340][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:51:28,666][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:51:28,997][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:51:29,322][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:51:29,647][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:51:29,976][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:51:30,302][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:51:30,628][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:51:30,958][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:51:31,293][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:51:31,620][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:51:31,948][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:51:32,273][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:51:32,613][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:51:32,938][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:51:33,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:51:33,975][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:51:34,689][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:51:34,691][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:51:34,693][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:51:35,608][__main__][INFO] - Iteration 582 took 22s (38.25% Gen, 57.72% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 15m 39s. Estimated total time: 19h 0m 25s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 0s, 500 more iterations: 3h 10m 4s. [2025-11-13 11:51:35,612][__main__][INFO] - Starting iteration 582. [2025-11-13 11:51:35,615][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:51:35,615][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:51:45,254][__main__][INFO] - Number of regex retries in iteration 582: 0 [2025-11-13 11:51:45,254][__main__][INFO] - agents played in iteration 582 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:51:45,729][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:45,762][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:45,795][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:45,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:45,829][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:51:45,830][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:51:46,544][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:51:46,845][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:51:47,170][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:51:47,496][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:51:47,822][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:51:48,146][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:51:48,474][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:51:48,800][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:51:49,128][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:51:49,458][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:51:49,795][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:51:50,120][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:51:50,447][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:51:50,774][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:51:51,100][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:51:51,429][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:51:51,758][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:51:52,086][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:51:52,418][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:51:52,746][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:51:53,074][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:51:53,405][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:51:53,725][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:51:54,052][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:51:54,378][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:51:54,706][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:51:55,033][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:51:55,359][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:51:55,686][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:51:56,011][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:51:56,337][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:51:56,662][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:51:56,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:51:57,686][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:51:58,403][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:51:58,405][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:51:58,406][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:51:59,269][__main__][INFO] - Iteration 583 took 23s (40.75% Gen, 55.60% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 57m 35s. Estimated total time: 19h 42m 44s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 25s, 500 more iterations: 3h 17m 7s. [2025-11-13 11:51:59,271][__main__][INFO] - Starting iteration 583. [2025-11-13 11:51:59,274][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:51:59,275][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:52:08,123][__main__][INFO] - Number of regex retries in iteration 583: 0 [2025-11-13 11:52:08,124][__main__][INFO] - agents played in iteration 583 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:52:08,584][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:08,617][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:08,650][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:08,683][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:08,684][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:52:08,684][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:52:09,396][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:52:09,695][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:52:10,022][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:52:10,349][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:52:10,676][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:52:11,015][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:52:11,341][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:52:11,670][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:52:11,996][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:52:12,327][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:52:12,653][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:52:12,982][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:52:13,317][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:52:13,644][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:52:13,972][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:52:14,299][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:52:14,636][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:52:14,961][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:52:15,290][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:52:15,616][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:52:15,942][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:52:16,268][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:52:16,595][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:52:16,921][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:52:17,246][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:52:17,572][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:52:17,898][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:52:18,222][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:52:18,548][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:52:18,872][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:52:19,199][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:52:19,524][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:52:19,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:52:20,550][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:52:21,273][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:52:21,275][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:52:21,277][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:52:22,143][__main__][INFO] - Iteration 584 took 22s (38.69% Gen, 57.51% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 17m 55s. Estimated total time: 19h 3m 28s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 6s, 500 more iterations: 3h 10m 34s. [2025-11-13 11:52:22,145][__main__][INFO] - Starting iteration 584. [2025-11-13 11:52:22,148][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:52:22,148][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:52:31,117][__main__][INFO] - Number of regex retries in iteration 584: 0 [2025-11-13 11:52:31,118][__main__][INFO] - agents played in iteration 584 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:52:31,568][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:31,601][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:31,650][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:31,683][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:31,684][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:52:31,685][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:52:32,431][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:52:32,729][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:52:33,058][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:52:33,382][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:52:33,710][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:52:34,038][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:52:34,365][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:52:34,692][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:52:35,020][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:52:35,344][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:52:35,670][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:52:36,001][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:52:36,327][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:52:36,652][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:52:36,984][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:52:37,317][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:52:37,646][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:52:37,973][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:52:38,299][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:52:38,631][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:52:38,957][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:52:39,281][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:52:39,606][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:52:39,933][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:52:40,258][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:52:40,584][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:52:40,909][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:52:41,235][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:52:41,560][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:52:41,886][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:52:42,213][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:52:42,540][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:52:42,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:52:43,551][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:52:44,256][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:52:44,257][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:52:44,259][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:52:45,126][__main__][INFO] - Iteration 585 took 22s (39.03% Gen, 57.18% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 23m 2s. Estimated total time: 19h 8m 58s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 17s, 500 more iterations: 3h 11m 29s. [2025-11-13 11:52:45,128][__main__][INFO] - Starting iteration 585. [2025-11-13 11:52:45,132][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:52:45,133][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:52:53,570][__main__][INFO] - Number of regex retries in iteration 585: 0 [2025-11-13 11:52:53,570][__main__][INFO] - agents played in iteration 585 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:52:54,028][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:54,062][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:54,096][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:54,130][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:54,131][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:52:54,131][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:52:54,850][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:52:55,155][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:52:55,474][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:52:55,803][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:52:56,130][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:52:56,463][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:52:56,782][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:52:57,109][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:52:57,436][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:52:57,762][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:52:58,089][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:52:58,416][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:52:58,743][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:52:59,071][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:52:59,396][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:52:59,726][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:53:00,052][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:53:00,380][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:53:00,709][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:53:01,039][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:53:01,365][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:53:01,694][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:53:02,020][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:53:02,346][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:53:02,672][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:53:02,999][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:53:03,327][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:53:03,652][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:53:03,978][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:53:04,306][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:53:04,634][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:53:04,958][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:53:05,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:53:05,977][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:53:06,683][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:53:06,685][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:53:06,686][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:53:07,540][__main__][INFO] - Iteration 586 took 22s (37.65% Gen, 58.53% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 54m 9s. Estimated total time: 18h 40m 27s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 20s, 500 more iterations: 3h 6m 44s. [2025-11-13 11:53:07,542][__main__][INFO] - Starting iteration 586. [2025-11-13 11:53:07,545][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:53:07,546][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:53:12,773][mllm.models.large_language_model_local][WARNING] - Response did not match regex: (|), retry 1/1 [2025-11-13 11:53:16,652][__main__][INFO] - Number of regex retries in iteration 586: 1 [2025-11-13 11:53:16,652][__main__][INFO] - agents played in iteration 586 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:53:17,095][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:17,127][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:17,162][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:17,196][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:17,196][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:53:17,197][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:53:17,920][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:53:18,218][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:53:18,545][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:53:18,872][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:53:19,197][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:53:19,521][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:53:19,849][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:53:20,174][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:53:20,500][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:53:20,826][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:53:21,153][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:53:21,482][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:53:21,819][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:53:22,146][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:53:22,473][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:53:22,805][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:53:23,139][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:53:23,468][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:53:23,794][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:53:24,119][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:53:24,447][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:53:24,775][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:53:25,100][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:53:25,429][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:53:25,750][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:53:26,076][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:53:26,403][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:53:26,733][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:53:27,055][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:53:27,383][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:53:27,711][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:53:28,042][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:53:28,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:53:29,046][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:53:29,775][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:53:29,777][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:53:29,778][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:53:30,889][__main__][INFO] - Iteration 587 took 23s (39.01% Gen, 56.22% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 40m 33s. Estimated total time: 19h 27m 14s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 54s, 500 more iterations: 3h 14m 32s. [2025-11-13 11:53:30,891][__main__][INFO] - Starting iteration 587. [2025-11-13 11:53:30,894][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:53:30,894][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:53:39,898][__main__][INFO] - Number of regex retries in iteration 587: 0 [2025-11-13 11:53:39,898][__main__][INFO] - agents played in iteration 587 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:53:40,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:40,401][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:40,434][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:40,468][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:40,468][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:53:40,469][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:53:41,201][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:53:41,497][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:53:41,824][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:53:42,151][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:53:42,477][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:53:42,803][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:53:43,128][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:53:43,455][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:53:43,781][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:53:44,104][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:53:44,432][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:53:44,759][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:53:45,086][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:53:45,412][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:53:45,739][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:53:46,066][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:53:46,391][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:53:46,722][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:53:47,048][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:53:47,375][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:53:47,702][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:53:48,031][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:53:48,359][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:53:48,685][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:53:49,012][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:53:49,343][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:53:49,670][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:53:49,997][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:53:50,322][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:53:50,650][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:53:50,978][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:53:51,303][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:53:51,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:53:52,351][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:53:53,061][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:53:53,063][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:53:53,064][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:53:53,940][__main__][INFO] - Iteration 588 took 23s (39.07% Gen, 57.12% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 25m 17s. Estimated total time: 19h 12m 21s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 24s, 500 more iterations: 3h 12m 3s. [2025-11-13 11:53:53,942][__main__][INFO] - Starting iteration 588. [2025-11-13 11:53:53,946][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:53:53,947][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:54:03,252][__main__][INFO] - Number of regex retries in iteration 588: 0 [2025-11-13 11:54:03,252][__main__][INFO] - agents played in iteration 588 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:54:03,722][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:03,755][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:03,788][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:03,822][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:03,822][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:54:03,822][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:54:04,538][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:54:04,835][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:54:05,161][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:54:05,486][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:54:05,813][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:54:06,138][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:54:06,464][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:54:06,789][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:54:07,116][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:54:07,441][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:54:07,767][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:54:08,093][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:54:08,420][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:54:08,745][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:54:09,074][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:54:09,399][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:54:09,726][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:54:10,053][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:54:10,382][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:54:10,708][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:54:11,035][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:54:11,362][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:54:11,690][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:54:12,015][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:54:12,342][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:54:12,668][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:54:12,993][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:54:13,320][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:54:13,649][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:54:13,976][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:54:14,302][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:54:14,626][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:54:14,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:54:15,667][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:54:16,387][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:54:16,388][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:54:16,390][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:54:17,404][__main__][INFO] - Iteration 589 took 23s (39.67% Gen, 56.00% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 45m 31s. Estimated total time: 19h 32m 58s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 5s, 500 more iterations: 3h 15m 29s. [2025-11-13 11:54:17,407][__main__][INFO] - Starting iteration 589. [2025-11-13 11:54:17,410][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:54:17,410][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:54:26,630][__main__][INFO] - Number of regex retries in iteration 589: 0 [2025-11-13 11:54:26,631][__main__][INFO] - agents played in iteration 589 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:54:27,071][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:27,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:27,137][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:27,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:27,171][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:54:27,172][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:54:27,935][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:54:28,233][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:54:28,559][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:54:28,884][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:54:29,211][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:54:29,537][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:54:29,863][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:54:30,188][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:54:30,515][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:54:30,843][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:54:31,171][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:54:31,496][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:54:31,822][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:54:32,149][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:54:32,476][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:54:32,803][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:54:33,131][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:54:33,458][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:54:33,783][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:54:34,110][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:54:34,442][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:54:34,770][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:54:35,095][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:54:35,423][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:54:35,749][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:54:36,076][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:54:36,401][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:54:36,726][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:54:37,054][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:54:37,380][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:54:37,706][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:54:38,033][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:54:38,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:54:39,048][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:54:39,766][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:54:39,768][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:54:39,769][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:54:40,651][__main__][INFO] - Iteration 590 took 23s (39.67% Gen, 56.53% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 34m 16s. Estimated total time: 19h 22m 7s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 44s, 500 more iterations: 3h 13m 41s. [2025-11-13 11:54:40,653][__main__][INFO] - Starting iteration 590. [2025-11-13 11:54:40,657][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:54:40,657][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:54:49,638][__main__][INFO] - Number of regex retries in iteration 590: 0 [2025-11-13 11:54:49,639][__main__][INFO] - agents played in iteration 590 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:54:50,082][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:50,115][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:50,148][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:50,181][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:50,182][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:54:50,183][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:54:50,907][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:54:51,205][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:54:51,531][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:54:51,857][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:54:52,184][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:54:52,514][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:54:52,839][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:54:53,166][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:54:53,493][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:54:53,819][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:54:54,145][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:54:54,470][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:54:54,795][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:54:55,129][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:54:55,455][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:54:55,782][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:54:56,108][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:54:56,446][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:54:56,771][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:54:57,097][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:54:57,423][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:54:57,764][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:54:58,091][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:54:58,417][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:54:58,742][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:54:59,071][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:54:59,396][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:54:59,723][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:55:00,050][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:55:00,376][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:55:00,702][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:55:01,028][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:55:01,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:55:02,052][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:55:02,772][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:55:02,773][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:55:02,775][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:55:04,508][__main__][INFO] - Iteration 591 took 23s (37.65% Gen, 55.07% Train). Generation: 8s, Training: 13s. Estimated remaining time: 19h 4m 22s. Estimated total time: 19h 52m 37s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 45s, 500 more iterations: 3h 18m 46s. [2025-11-13 11:55:04,511][__main__][INFO] - Starting iteration 591. [2025-11-13 11:55:04,514][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:55:04,514][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:55:12,987][__main__][INFO] - Number of regex retries in iteration 591: 0 [2025-11-13 11:55:12,987][__main__][INFO] - agents played in iteration 591 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:55:13,449][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:13,482][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:13,515][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:13,548][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:13,548][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:55:13,549][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:55:14,276][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:55:14,574][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:55:14,901][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:55:15,226][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:55:15,551][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:55:15,885][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:55:16,203][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:55:16,532][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:55:16,857][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:55:17,184][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:55:17,513][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:55:17,838][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:55:18,167][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:55:18,494][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:55:18,818][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:55:19,146][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:55:19,473][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:55:19,801][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:55:20,126][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:55:20,453][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:55:20,780][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:55:21,107][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:55:21,432][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:55:21,758][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:55:22,084][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:55:22,411][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:55:22,737][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:55:23,062][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:55:23,389][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:55:23,715][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:55:24,044][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:55:24,372][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:55:24,699][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:55:25,408][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:55:26,113][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:55:26,115][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:55:26,117][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:55:27,083][__main__][INFO] - Iteration 592 took 22s (37.54% Gen, 58.17% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 59m 53s. Estimated total time: 18h 48m 31s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 37s, 500 more iterations: 3h 8m 5s. [2025-11-13 11:55:27,085][__main__][INFO] - Starting iteration 592. [2025-11-13 11:55:27,089][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:55:27,089][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:55:36,257][__main__][INFO] - Number of regex retries in iteration 592: 0 [2025-11-13 11:55:36,257][__main__][INFO] - agents played in iteration 592 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:55:36,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:36,733][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:36,766][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:36,799][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:36,800][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:55:36,800][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:55:37,540][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:55:37,837][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:55:38,164][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:55:38,491][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:55:38,819][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:55:39,144][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:55:39,470][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:55:39,797][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:55:40,124][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:55:40,448][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:55:40,775][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:55:41,104][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:55:41,427][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:55:41,754][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:55:42,080][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:55:42,407][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:55:42,734][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:55:43,061][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:55:43,387][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:55:43,712][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:55:44,037][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:55:44,363][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:55:44,690][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:55:45,018][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:55:45,344][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:55:45,671][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:55:45,998][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:55:46,324][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:55:46,655][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:55:46,985][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:55:47,313][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:55:47,640][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:55:47,967][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:55:48,674][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:55:49,406][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:55:49,408][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:55:49,409][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:55:50,341][__main__][INFO] - Iteration 593 took 23s (39.42% Gen, 56.56% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 33m 38s. Estimated total time: 19h 22m 39s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 45s, 500 more iterations: 3h 13m 46s. [2025-11-13 11:55:50,343][__main__][INFO] - Starting iteration 593. [2025-11-13 11:55:50,346][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:55:50,347][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:55:59,227][__main__][INFO] - Number of regex retries in iteration 593: 0 [2025-11-13 11:55:59,228][__main__][INFO] - agents played in iteration 593 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:55:59,679][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:59,711][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:59,745][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:59,778][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:59,779][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:55:59,779][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:56:00,506][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:56:00,809][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:56:01,133][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:56:01,461][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:56:01,786][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:56:02,113][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:56:02,438][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:56:02,763][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:56:03,091][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:56:03,419][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:56:03,746][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:56:04,072][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:56:04,398][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:56:04,724][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:56:05,048][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:56:05,375][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:56:05,700][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:56:06,026][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:56:06,352][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:56:06,678][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:56:07,004][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:56:07,329][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:56:07,655][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:56:07,982][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:56:08,309][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:56:08,636][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:56:08,962][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:56:09,288][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:56:09,612][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:56:09,937][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:56:10,264][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:56:10,592][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:56:10,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:56:11,645][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:56:12,351][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:56:12,352][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:56:12,354][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:56:13,262][__main__][INFO] - Iteration 594 took 22s (38.75% Gen, 57.28% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 16m 26s. Estimated total time: 19h 5m 50s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 11s, 500 more iterations: 3h 10m 58s. [2025-11-13 11:56:13,264][__main__][INFO] - Starting iteration 594. [2025-11-13 11:56:13,267][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:56:13,268][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:56:21,447][__main__][INFO] - Number of regex retries in iteration 594: 0 [2025-11-13 11:56:21,448][__main__][INFO] - agents played in iteration 594 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:56:21,903][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:21,939][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:21,972][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:22,005][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:22,005][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:56:22,006][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:56:22,728][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:56:23,026][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:56:23,355][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:56:23,681][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:56:24,009][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:56:24,341][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:56:24,667][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:56:24,993][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:56:25,320][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:56:25,648][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:56:25,977][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:56:26,303][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:56:26,629][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:56:26,955][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:56:27,282][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:56:27,608][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:56:27,933][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:56:28,259][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:56:28,586][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:56:28,913][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:56:29,238][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:56:29,564][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:56:29,889][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:56:30,214][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:56:30,545][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:56:30,871][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:56:31,196][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:56:31,528][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:56:31,850][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:56:32,176][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:56:32,502][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:56:32,831][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:56:33,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:56:33,876][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:56:34,600][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:56:34,602][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:56:34,604][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:56:35,551][__main__][INFO] - Iteration 595 took 22s (36.71% Gen, 59.03% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 44m 28s. Estimated total time: 18h 34m 14s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 8s, 500 more iterations: 3h 5m 42s. [2025-11-13 11:56:35,554][__main__][INFO] - Starting iteration 595. [2025-11-13 11:56:35,557][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:56:35,558][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:56:44,346][__main__][INFO] - Number of regex retries in iteration 595: 0 [2025-11-13 11:56:44,347][__main__][INFO] - agents played in iteration 595 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:56:44,792][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:44,824][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:44,857][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:44,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:44,891][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:56:44,892][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:56:45,609][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:56:45,906][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:56:46,233][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:56:46,558][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:56:46,883][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:56:47,208][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:56:47,535][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:56:47,864][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:56:48,190][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:56:48,521][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:56:48,841][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:56:49,167][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:56:49,492][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:56:49,819][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:56:50,146][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:56:50,474][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:56:50,799][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:56:51,127][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:56:51,454][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:56:51,780][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:56:52,104][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:56:52,430][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:56:52,756][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:56:53,082][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:56:53,407][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:56:53,736][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:56:54,061][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:56:54,388][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:56:54,718][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:56:55,045][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:56:55,371][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:56:55,697][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:56:56,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:56:56,740][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:56:57,454][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:56:57,456][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:56:57,458][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:56:58,419][__main__][INFO] - Iteration 596 took 22s (38.44% Gen, 57.35% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 12m 59s. Estimated total time: 19h 3m 8s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 6s, 500 more iterations: 3h 10m 31s. [2025-11-13 11:56:58,421][__main__][INFO] - Starting iteration 596. [2025-11-13 11:56:58,424][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:56:58,425][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:57:06,970][__main__][INFO] - Number of regex retries in iteration 596: 0 [2025-11-13 11:57:06,971][__main__][INFO] - agents played in iteration 596 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:57:07,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:07,445][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:07,478][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:07,511][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:07,512][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:57:07,512][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:57:08,235][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:57:08,534][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:57:08,862][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:57:09,186][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:57:09,513][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:57:09,843][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:57:10,170][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:57:10,503][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:57:10,829][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:57:11,153][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:57:11,478][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:57:11,805][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:57:12,132][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:57:12,457][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:57:12,785][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:57:13,110][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:57:13,437][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:57:13,764][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:57:14,090][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:57:14,417][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:57:14,744][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:57:15,070][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:57:15,397][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:57:15,722][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:57:16,049][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:57:16,375][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:57:16,703][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:57:17,030][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:57:17,359][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:57:17,684][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:57:18,011][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:57:18,337][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:57:18,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:57:19,353][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:57:20,065][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:57:20,066][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:57:20,068][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:57:20,968][__main__][INFO] - Iteration 597 took 22s (37.91% Gen, 58.09% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 56m 43s. Estimated total time: 18h 47m 14s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 34s, 500 more iterations: 3h 7m 52s. [2025-11-13 11:57:20,970][__main__][INFO] - Starting iteration 597. [2025-11-13 11:57:20,973][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:57:20,974][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:57:29,523][__main__][INFO] - Number of regex retries in iteration 597: 0 [2025-11-13 11:57:29,524][__main__][INFO] - agents played in iteration 597 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:57:29,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:29,988][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:30,022][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:30,055][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:30,056][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:57:30,056][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:57:30,767][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:57:31,064][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:57:31,390][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:57:31,715][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:57:32,040][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:57:32,365][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:57:32,690][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:57:33,015][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:57:33,340][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:57:33,665][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:57:33,993][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:57:34,319][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:57:34,645][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:57:34,972][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:57:35,299][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:57:35,627][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:57:35,953][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:57:36,280][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:57:36,606][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:57:36,932][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:57:37,258][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:57:37,592][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:57:37,919][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:57:38,244][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:57:38,571][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:57:38,907][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:57:39,234][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:57:39,561][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:57:39,889][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:57:40,217][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:57:40,543][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:57:40,869][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:57:41,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:57:41,893][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:57:42,631][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:57:42,632][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:57:42,634][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:57:43,596][__main__][INFO] - Iteration 598 took 22s (37.79% Gen, 57.95% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 0m 17s. Estimated total time: 18h 51m 10s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 42s, 500 more iterations: 3h 8m 31s. [2025-11-13 11:57:43,598][__main__][INFO] - Starting iteration 598. [2025-11-13 11:57:43,601][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:57:43,602][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:57:52,354][__main__][INFO] - Number of regex retries in iteration 598: 0 [2025-11-13 11:57:52,355][__main__][INFO] - agents played in iteration 598 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:57:52,794][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:52,827][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:52,860][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:52,894][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:52,894][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:57:52,894][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:57:53,616][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:57:53,914][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:57:54,241][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:57:54,574][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:57:54,907][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:57:55,237][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:57:55,567][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:57:55,902][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:57:56,233][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:57:56,560][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:57:56,885][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:57:57,220][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:57:57,555][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:57:57,884][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:57:58,219][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:57:58,540][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:57:58,866][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:57:59,192][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:57:59,526][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:57:59,841][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:58:00,169][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:58:00,495][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:58:00,830][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:58:01,148][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:58:01,473][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:58:01,799][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:58:02,126][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:58:02,454][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:58:02,781][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:58:03,108][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:58:03,435][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:58:03,761][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:58:04,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:58:04,814][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:58:05,542][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:58:05,543][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:58:05,545][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:58:06,431][__main__][INFO] - Iteration 599 took 22s (38.34% Gen, 57.78% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 10m 14s. Estimated total time: 19h 1m 31s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 3s, 500 more iterations: 3h 10m 15s. [2025-11-13 11:58:06,433][__main__][INFO] - Starting iteration 599. [2025-11-13 11:58:06,436][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:58:06,436][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:58:15,459][__main__][INFO] - Number of regex retries in iteration 599: 0 [2025-11-13 11:58:15,460][__main__][INFO] - agents played in iteration 599 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:58:15,892][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:15,925][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:15,958][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:15,991][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:15,991][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:58:15,992][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:58:16,715][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:58:17,010][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:58:17,341][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:58:17,667][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:58:17,995][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:58:18,321][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:58:18,651][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:58:18,976][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:58:19,302][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:58:19,628][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:58:19,958][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:58:20,288][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:58:20,613][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:58:20,945][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:58:21,273][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:58:21,600][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:58:21,928][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:58:22,256][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:58:22,581][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:58:22,907][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:58:23,234][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:58:23,560][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:58:23,886][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:58:24,211][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:58:24,536][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:58:24,861][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:58:25,186][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:58:25,511][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:58:25,836][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:58:26,163][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:58:26,489][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:58:26,815][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:58:27,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:58:27,852][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:58:28,594][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:58:28,596][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:58:28,598][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:58:29,467][__main__][INFO] - Iteration 600 took 23s (39.18% Gen, 57.04% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 19m 57s. Estimated total time: 19h 11m 36s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 23s, 500 more iterations: 3h 11m 56s. [2025-11-13 11:58:29,469][__main__][INFO] - Starting iteration 600. [2025-11-13 11:58:29,473][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:58:29,473][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:58:38,126][__main__][INFO] - Number of regex retries in iteration 600: 0 [2025-11-13 11:58:38,127][__main__][INFO] - agents played in iteration 600 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:58:38,559][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:38,592][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:38,635][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:38,673][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:38,673][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:58:38,673][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:58:39,379][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:58:39,676][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:58:40,003][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:58:40,330][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:58:40,658][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:58:40,985][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:58:41,313][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:58:41,639][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:58:41,965][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:58:42,291][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:58:42,620][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:58:42,946][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:58:43,273][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:58:43,599][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:58:43,929][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:58:44,255][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:58:44,582][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:58:44,909][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:58:45,243][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:58:45,569][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:58:45,896][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:58:46,222][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:58:46,549][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:58:46,877][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:58:47,204][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:58:47,531][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:58:47,859][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:58:48,185][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:58:48,510][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:58:48,835][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:58:49,163][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:58:49,489][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:58:49,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:58:50,541][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:58:51,248][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:58:51,249][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:58:51,251][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:58:53,273][__main__][INFO] - Iteration 601 took 23s (36.36% Gen, 55.14% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 57m 59s. Estimated total time: 19h 50m 3s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 40s, 500 more iterations: 3h 18m 20s. [2025-11-13 11:58:53,275][__main__][INFO] - Starting iteration 601. [2025-11-13 11:58:53,279][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 11:58:53,279][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:59:01,860][__main__][INFO] - Number of regex retries in iteration 601: 0 [2025-11-13 11:59:01,860][__main__][INFO] - agents played in iteration 601 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:59:02,321][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:02,354][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:02,387][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:02,420][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:02,420][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:59:02,421][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:59:03,122][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:59:03,420][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:59:03,747][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:59:04,076][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:59:04,404][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:59:04,734][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:59:05,060][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:59:05,385][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:59:05,712][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:59:06,037][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:59:06,369][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:59:06,700][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:59:07,026][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:59:07,353][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:59:07,679][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:59:08,006][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:59:08,334][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:59:08,667][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:59:08,994][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:59:09,321][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:59:09,647][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:59:09,973][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:59:10,299][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:59:10,625][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:59:10,950][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:59:11,276][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:59:11,601][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:59:11,927][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:59:12,253][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:59:12,577][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:59:12,902][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:59:13,229][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:59:13,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:59:14,281][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:59:14,997][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:59:14,998][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:59:15,000][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:59:15,892][__main__][INFO] - Iteration 602 took 22s (37.95% Gen, 58.10% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 58m 16s. Estimated total time: 18h 50m 42s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 41s, 500 more iterations: 3h 8m 27s. [2025-11-13 11:59:15,894][__main__][INFO] - Starting iteration 602. [2025-11-13 11:59:15,897][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 11:59:15,898][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:59:24,559][__main__][INFO] - Number of regex retries in iteration 602: 0 [2025-11-13 11:59:24,560][__main__][INFO] - agents played in iteration 602 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:59:25,000][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:25,033][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:25,066][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:25,100][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:25,100][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:59:25,100][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:59:25,832][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:59:26,131][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:59:26,459][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:59:26,786][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:59:27,113][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:59:27,440][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:59:27,766][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:59:28,090][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:59:28,414][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:59:28,743][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:59:29,067][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:59:29,392][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:59:29,721][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:59:30,047][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:59:30,372][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:59:30,698][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:59:31,028][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:59:31,359][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:59:31,687][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:59:32,017][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:59:32,345][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:59:32,670][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:59:32,998][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:59:33,323][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:59:33,649][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:59:33,975][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:59:34,301][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:59:34,635][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:59:34,954][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:59:35,278][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:59:35,604][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:59:35,938][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:59:36,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:59:36,974][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:59:37,699][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:59:37,701][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:59:37,702][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:59:38,634][__main__][INFO] - Iteration 603 took 22s (38.09% Gen, 57.80% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 4m 5s. Estimated total time: 18h 56m 53s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 53s, 500 more iterations: 3h 9m 28s. [2025-11-13 11:59:38,637][__main__][INFO] - Starting iteration 603. [2025-11-13 11:59:38,640][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 11:59:38,640][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:59:48,090][__main__][INFO] - Number of regex retries in iteration 603: 0 [2025-11-13 11:59:48,091][__main__][INFO] - agents played in iteration 603 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 11:59:48,523][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:48,556][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:48,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:48,623][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:48,624][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:59:48,624][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:59:49,362][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:59:49,659][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:59:49,987][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:59:50,312][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:59:50,637][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:59:50,963][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:59:51,288][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:59:51,617][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:59:51,942][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:59:52,268][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:59:52,595][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:59:52,920][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:59:53,248][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:59:53,575][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:59:53,903][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:59:54,242][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:59:54,569][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:59:54,897][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:59:55,223][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:59:55,552][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:59:55,880][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:59:56,206][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:59:56,531][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:59:56,857][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:59:57,182][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:59:57,507][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:59:57,838][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:59:58,159][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:59:58,485][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:59:58,812][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:59:59,144][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:59:59,467][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:59:59,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:00:00,511][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:00:01,245][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:00:01,247][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:00:01,248][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:00:02,560][__main__][INFO] - Iteration 604 took 23s (39.51% Gen, 55.00% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 2m 52s. Estimated total time: 19h 56m 5s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 52s, 500 more iterations: 3h 19m 20s. [2025-11-13 12:00:02,562][__main__][INFO] - Starting iteration 604. [2025-11-13 12:00:02,566][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:00:02,567][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:00:11,597][__main__][INFO] - Number of regex retries in iteration 604: 0 [2025-11-13 12:00:11,598][__main__][INFO] - agents played in iteration 604 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 12:00:12,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:12,064][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:12,097][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:12,130][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:12,131][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:00:12,131][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:00:12,850][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:00:13,147][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:00:13,475][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:00:13,800][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:00:14,130][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:00:14,461][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:00:14,789][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:00:15,114][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:00:15,440][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:00:15,766][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:00:16,093][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:00:16,419][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:00:16,746][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:00:17,076][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:00:17,405][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:00:17,733][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:00:18,063][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:00:18,391][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:00:18,716][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:00:19,042][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:00:19,368][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:00:19,697][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:00:20,022][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:00:20,357][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:00:20,675][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:00:21,002][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:00:21,328][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:00:21,661][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:00:21,980][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:00:22,305][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:00:22,630][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:00:22,956][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:00:23,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:00:23,992][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:00:24,693][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:00:24,695][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:00:24,696][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:00:25,681][__main__][INFO] - Iteration 605 took 23s (39.07% Gen, 56.66% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 22m 12s. Estimated total time: 19h 15m 48s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 31s, 500 more iterations: 3h 12m 38s. [2025-11-13 12:00:25,683][__main__][INFO] - Starting iteration 605. [2025-11-13 12:00:25,686][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:00:25,686][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:00:34,335][__main__][INFO] - Number of regex retries in iteration 605: 0 [2025-11-13 12:00:34,335][__main__][INFO] - agents played in iteration 605 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 12:00:34,782][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:34,816][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:34,848][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:34,881][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:34,883][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:00:34,883][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:00:35,603][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:00:35,900][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:00:36,238][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:00:36,565][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:00:36,896][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:00:37,228][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:00:37,556][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:00:37,883][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:00:38,210][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:00:38,538][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:00:38,870][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:00:39,196][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:00:39,523][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:00:39,858][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:00:40,176][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:00:40,504][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:00:40,835][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:00:41,165][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:00:41,493][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:00:41,818][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:00:42,142][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:00:42,474][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:00:42,798][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:00:43,127][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:00:43,458][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:00:43,783][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:00:44,109][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:00:44,436][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:00:44,761][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:00:45,090][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:00:45,417][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:00:45,742][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:00:46,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:00:46,912][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:00:47,596][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:00:47,598][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:00:47,599][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:00:48,725][__main__][INFO] - Iteration 606 took 23s (37.54% Gen, 57.57% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 18m 0s. Estimated total time: 19h 11m 59s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 23s, 500 more iterations: 3h 11m 59s. [2025-11-13 12:00:48,727][__main__][INFO] - Starting iteration 606. [2025-11-13 12:00:48,729][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:00:48,730][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:00:57,420][__main__][INFO] - Number of regex retries in iteration 606: 0 [2025-11-13 12:00:57,420][__main__][INFO] - agents played in iteration 606 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 12:00:57,856][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:57,889][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:57,922][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:57,955][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:57,956][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:00:57,956][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:00:58,673][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:00:58,969][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:00:59,296][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:00:59,622][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:00:59,950][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:01:00,277][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:01:00,603][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:01:00,936][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:01:01,260][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:01:01,584][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:01:01,913][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:01:02,243][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:01:02,571][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:01:02,899][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:01:03,229][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:01:03,562][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:01:03,884][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:01:04,214][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:01:04,541][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:01:04,875][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:01:05,198][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:01:05,527][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:01:05,855][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:01:06,183][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:01:06,510][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:01:06,838][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:01:07,166][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:01:07,493][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:01:07,822][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:01:08,148][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:01:08,476][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:01:08,802][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:01:09,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:01:09,858][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:01:10,562][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:01:10,566][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:01:10,567][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:01:11,525][__main__][INFO] - Iteration 607 took 22s (38.12% Gen, 57.67% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 5m 27s. Estimated total time: 18h 59m 49s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 59s, 500 more iterations: 3h 9m 58s. [2025-11-13 12:01:11,527][__main__][INFO] - Starting iteration 607. [2025-11-13 12:01:11,529][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:01:11,530][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:01:20,766][__main__][INFO] - Number of regex retries in iteration 607: 0 [2025-11-13 12:01:20,767][__main__][INFO] - agents played in iteration 607 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 12:01:21,207][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:21,240][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:21,273][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:21,306][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:21,307][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:01:21,307][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:01:22,025][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:01:22,326][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:01:22,654][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:01:22,984][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:01:23,309][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:01:23,645][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:01:23,975][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:01:24,301][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:01:24,626][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:01:24,959][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:01:25,286][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:01:25,613][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:01:25,941][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:01:26,267][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:01:26,599][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:01:26,930][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:01:27,257][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:01:27,583][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:01:27,909][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:01:28,235][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:01:28,563][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:01:28,890][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:01:29,216][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:01:29,543][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:01:29,868][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:01:30,194][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:01:30,520][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:01:30,846][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:01:31,173][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:01:31,503][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:01:31,830][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:01:32,156][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:01:32,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:01:33,201][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:01:33,877][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:01:33,879][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:01:33,880][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:01:35,337][__main__][INFO] - Iteration 608 took 23s (38.80% Gen, 55.08% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 55m 41s. Estimated total time: 19h 50m 26s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 40s, 500 more iterations: 3h 18m 24s. [2025-11-13 12:01:35,339][__main__][INFO] - Starting iteration 608. [2025-11-13 12:01:35,342][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:01:35,343][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:01:44,518][__main__][INFO] - Number of regex retries in iteration 608: 0 [2025-11-13 12:01:44,518][__main__][INFO] - agents played in iteration 608 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 12:01:44,981][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:45,015][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:45,047][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:45,080][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:45,081][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:01:45,081][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:01:45,789][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:01:46,090][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:01:46,415][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:01:46,742][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:01:47,068][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:01:47,393][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:01:47,719][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:01:48,045][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:01:48,371][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:01:48,696][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:01:49,022][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:01:49,347][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:01:49,672][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:01:49,997][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:01:50,321][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:01:50,648][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:01:50,974][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:01:51,301][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:01:51,633][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:01:51,962][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:01:52,289][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:01:52,617][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:01:52,942][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:01:53,269][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:01:53,594][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:01:53,920][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:01:54,256][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:01:54,583][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:01:54,911][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:01:55,238][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:01:55,570][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:01:55,899][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:01:56,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:01:56,954][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:01:57,644][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:01:57,646][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:01:57,648][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:01:58,496][__main__][INFO] - Iteration 609 took 23s (39.63% Gen, 56.70% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 22m 36s. Estimated total time: 19h 17m 45s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 35s, 500 more iterations: 3h 12m 57s. [2025-11-13 12:01:58,498][__main__][INFO] - Starting iteration 609. [2025-11-13 12:01:58,501][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:01:58,501][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:02:07,947][__main__][INFO] - Number of regex retries in iteration 609: 0 [2025-11-13 12:02:07,948][__main__][INFO] - agents played in iteration 609 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 12:02:08,406][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:08,439][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:08,472][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:08,505][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:08,506][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:02:08,507][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:02:09,214][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:02:09,512][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:02:09,838][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:02:10,163][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:02:10,489][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:02:10,814][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:02:11,142][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:02:11,472][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:02:11,798][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:02:12,127][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:02:12,453][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:02:12,783][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:02:13,107][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:02:13,433][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:02:13,758][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:02:14,084][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:02:14,410][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:02:14,736][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:02:15,073][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:02:15,401][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:02:15,729][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:02:16,056][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:02:16,385][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:02:16,713][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:02:17,038][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:02:17,370][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:02:17,699][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:02:18,027][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:02:18,353][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:02:18,681][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:02:19,006][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:02:19,331][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:02:19,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:02:20,386][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:02:21,076][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:02:21,078][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:02:21,079][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:02:22,374][__main__][INFO] - Iteration 610 took 23s (39.57% Gen, 55.00% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 58m 10s. Estimated total time: 19h 53m 43s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 47s, 500 more iterations: 3h 18m 57s. [2025-11-13 12:02:22,376][__main__][INFO] - Starting iteration 610. [2025-11-13 12:02:22,379][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:02:22,379][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:02:32,149][__main__][INFO] - Number of regex retries in iteration 610: 0 [2025-11-13 12:02:32,150][__main__][INFO] - agents played in iteration 610 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 12:02:32,580][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:32,613][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:32,646][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:32,680][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:32,680][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:02:32,681][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:02:33,424][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:02:33,721][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:02:34,047][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:02:34,374][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:02:34,704][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:02:35,026][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:02:35,355][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:02:35,683][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:02:36,009][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:02:36,334][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:02:36,660][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:02:36,987][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:02:37,312][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:02:37,641][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:02:37,966][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:02:38,294][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:02:38,619][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:02:38,947][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:02:39,273][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:02:39,600][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:02:39,926][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:02:40,260][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:02:40,585][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:02:40,915][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:02:41,243][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:02:41,576][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:02:41,902][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:02:42,234][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:02:42,563][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:02:42,896][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:02:43,221][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:02:43,548][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:02:43,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:02:44,586][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:02:45,310][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:02:45,312][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:02:45,313][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:02:47,459][__main__][INFO] - Iteration 611 took 25s (38.95% Gen, 52.48% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 58m 4s. Estimated total time: 20h 54m 2s. Time estimates for 10 more iterations: 4m 10s, 100 more iterations: 41m 48s, 500 more iterations: 3h 29m 0s. [2025-11-13 12:02:47,461][__main__][INFO] - Starting iteration 611. [2025-11-13 12:02:47,463][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:02:47,464][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:02:56,493][__main__][INFO] - Number of regex retries in iteration 611: 0 [2025-11-13 12:02:56,493][__main__][INFO] - agents played in iteration 611 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 12:02:56,932][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:56,964][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:57,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:57,050][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:57,050][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:02:57,051][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:02:57,777][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:02:58,075][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:02:58,402][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:02:58,730][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:02:59,058][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:02:59,385][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:02:59,718][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:03:00,045][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:03:00,371][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:03:00,699][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:03:01,024][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:03:01,349][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:03:01,674][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:03:02,001][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:03:02,329][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:03:02,656][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:03:02,984][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:03:03,314][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:03:03,646][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:03:03,976][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:03:04,303][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:03:04,633][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:03:04,962][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:03:05,289][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:03:05,616][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:03:05,942][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:03:06,272][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:03:06,599][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:03:06,926][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:03:07,251][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:03:07,577][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:03:07,904][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:03:08,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:03:08,963][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:03:09,674][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:03:09,684][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:03:09,686][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:03:10,532][__main__][INFO] - Iteration 612 took 23s (39.14% Gen, 57.19% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 17m 6s. Estimated total time: 19h 13m 27s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 26s, 500 more iterations: 3h 12m 14s. [2025-11-13 12:03:10,534][__main__][INFO] - Starting iteration 612. [2025-11-13 12:03:10,536][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:03:10,537][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:03:19,945][__main__][INFO] - Number of regex retries in iteration 612: 0 [2025-11-13 12:03:19,945][__main__][INFO] - agents played in iteration 612 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 12:03:20,392][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:20,425][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:20,458][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:20,491][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:20,492][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:03:20,492][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:03:21,218][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:03:21,513][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:03:21,840][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:03:22,171][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:03:22,497][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:03:22,825][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:03:23,151][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:03:23,479][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:03:23,806][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:03:24,132][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:03:24,459][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:03:24,786][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:03:25,116][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:03:25,443][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:03:25,775][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:03:26,108][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:03:26,438][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:03:26,771][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:03:27,098][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:03:27,425][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:03:27,752][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:03:28,077][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:03:28,406][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:03:28,734][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:03:29,059][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:03:29,384][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:03:29,709][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:03:30,034][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:03:30,362][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:03:30,688][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:03:31,016][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:03:31,344][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:03:31,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:03:32,415][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:03:33,093][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:03:33,094][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:03:33,096][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:03:34,206][__main__][INFO] - Iteration 613 took 23s (39.74% Gen, 55.56% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 46m 47s. Estimated total time: 19h 43m 31s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 27s, 500 more iterations: 3h 17m 15s. [2025-11-13 12:03:34,209][__main__][INFO] - Starting iteration 613. [2025-11-13 12:03:34,212][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:03:34,213][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:03:43,301][__main__][INFO] - Number of regex retries in iteration 613: 0 [2025-11-13 12:03:43,301][__main__][INFO] - agents played in iteration 613 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 12:03:43,735][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:43,768][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:43,801][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:43,834][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:43,835][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:03:43,835][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:03:44,539][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:03:44,834][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:03:45,161][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:03:45,487][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:03:45,813][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:03:46,143][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:03:46,470][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:03:46,803][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:03:47,122][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:03:47,448][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:03:47,773][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:03:48,099][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:03:48,424][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:03:48,749][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:03:49,076][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:03:49,409][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:03:49,730][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:03:50,055][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:03:50,384][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:03:50,716][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:03:51,038][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:03:51,366][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:03:51,689][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:03:52,016][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:03:52,344][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:03:52,671][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:03:52,998][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:03:53,325][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:03:53,650][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:03:53,975][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:03:54,302][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:03:54,628][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:03:54,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:03:55,673][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:03:56,391][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:03:56,395][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:03:56,397][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:03:57,219][__main__][INFO] - Iteration 614 took 23s (39.50% Gen, 56.92% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 13m 16s. Estimated total time: 19h 10m 24s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 20s, 500 more iterations: 3h 11m 44s. [2025-11-13 12:03:57,221][__main__][INFO] - Starting iteration 614. [2025-11-13 12:03:57,224][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:03:57,224][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:04:06,266][__main__][INFO] - Number of regex retries in iteration 614: 0 [2025-11-13 12:04:06,267][__main__][INFO] - agents played in iteration 614 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 12:04:06,720][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:06,753][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:06,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:06,819][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:06,819][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:04:06,820][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:04:07,839][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:04:08,136][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:04:08,462][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:04:08,789][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:04:09,121][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:04:09,449][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:04:09,773][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:04:10,098][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:04:10,424][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:04:10,751][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:04:11,079][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:04:11,409][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:04:11,740][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:04:12,072][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:04:12,396][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:04:12,720][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:04:13,046][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:04:13,382][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:04:13,711][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:04:14,036][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:04:14,361][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:04:14,693][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:04:15,018][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:04:15,345][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:04:15,670][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:04:16,001][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:04:16,328][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:04:16,655][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:04:16,987][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:04:17,309][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:04:17,638][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:04:17,963][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:04:18,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:04:19,020][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:04:19,730][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:04:19,732][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:04:19,733][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:04:20,599][__main__][INFO] - Iteration 615 took 23s (38.68% Gen, 57.61% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 31m 16s. Estimated total time: 19h 28m 46s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 57s, 500 more iterations: 3h 14m 47s. [2025-11-13 12:04:20,601][__main__][INFO] - Starting iteration 615. [2025-11-13 12:04:20,604][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:04:20,605][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:04:29,730][__main__][INFO] - Number of regex retries in iteration 615: 0 [2025-11-13 12:04:29,731][__main__][INFO] - agents played in iteration 615 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 12:04:30,168][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:30,529][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:30,561][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:30,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:30,595][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:04:30,595][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:04:31,302][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:04:31,600][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:04:31,927][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:04:32,252][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:04:32,578][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:04:32,906][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:04:33,232][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:04:33,559][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:04:33,888][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:04:34,217][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:04:34,545][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:04:34,873][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:04:35,204][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:04:35,533][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:04:35,863][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:04:36,190][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:04:36,521][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:04:36,850][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:04:37,178][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:04:37,506][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:04:37,833][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:04:38,160][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:04:38,488][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:04:38,817][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:04:39,144][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:04:39,473][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:04:39,800][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:04:40,125][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:04:40,451][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:04:40,777][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:04:41,104][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:04:41,431][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:04:41,757][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:04:42,461][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:04:43,153][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:04:43,156][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:04:43,157][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:04:43,981][__main__][INFO] - Iteration 616 took 23s (39.03% Gen, 57.43% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 31m 1s. Estimated total time: 19h 28m 55s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 57s, 500 more iterations: 3h 14m 49s. [2025-11-13 12:04:43,983][__main__][INFO] - Starting iteration 616. [2025-11-13 12:04:43,986][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:04:43,986][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:04:53,575][__main__][INFO] - Number of regex retries in iteration 616: 0 [2025-11-13 12:04:53,575][__main__][INFO] - agents played in iteration 616 are Bob_buffer, Alice_buffer, Alice, Bob [2025-11-13 12:04:54,010][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:54,047][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:54,082][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:54,115][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:54,115][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:04:54,116][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:04:54,810][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:04:55,107][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:04:55,433][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:04:55,760][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:04:56,088][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:04:56,418][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:04:56,749][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:04:57,075][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:04:57,402][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:04:57,730][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:04:58,056][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:04:58,381][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:04:58,707][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:04:59,033][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:04:59,365][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:04:59,692][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:05:00,018][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:05:00,344][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:05:00,670][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:05:00,997][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:05:01,325][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:05:01,650][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:05:01,976][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:05:02,303][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:05:02,631][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:05:02,958][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:05:03,298][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:05:03,624][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:05:03,952][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:05:04,279][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:05:04,607][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:05:04,933][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:05:05,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:05:05,973][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:05:06,655][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:05:06,657][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:05:06,659][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed4321_bs128/seed_4321/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:05:07,870][__main__][INFO] - Iteration 617 took 23s (40.15% Gen, 54.78% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 55m 57s. Estimated total time: 19h 54m 15s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 48s, 500 more iterations: 3h 19m 2s. [2025-11-13 12:05:07,876][__main__][INFO] - Starting iteration 617. [2025-11-13 12:05:11,022][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,022][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,023][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,023][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,023][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,023][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,024][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,024][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,024][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,024][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,025][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,025][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,025][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,025][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,026][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,026][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,026][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,026][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,026][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,027][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,027][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,027][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,027][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,028][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,028][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,028][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,028][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,028][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,029][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,029][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,029][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,029][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,029][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,030][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,030][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,030][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,030][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,030][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,030][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,031][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,031][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,031][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,031][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,031][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,031][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,032][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,032][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,032][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,032][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,032][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,033][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,033][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,033][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,033][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,033][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,033][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,033][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,034][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,034][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,034][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,034][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,034][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,034][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,035][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,035][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,036][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,036][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,038][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,038][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,040][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,041][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,041][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,041][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,042][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,042][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,042][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,042][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,042][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,042][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,042][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,043][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,043][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,043][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,043][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,043][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,043][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,043][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,043][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,044][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,044][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,044][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,044][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,044][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,044][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,044][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,044][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,045][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,045][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,045][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,045][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,045][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,045][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,045][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,045][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,046][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,046][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,046][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,046][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,046][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,046][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,046][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,046][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,047][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,047][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,047][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,047][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,047][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,047][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,047][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,047][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,048][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,048][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,048][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,048][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,048][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,048][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,048][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,049][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,049][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,049][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,049][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,049][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,049][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,049][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,049][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,050][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,050][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,050][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,050][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,050][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,050][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,050][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,050][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,051][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,051][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,051][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,051][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,051][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,051][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,051][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,051][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,052][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,052][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,052][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,052][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,052][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,052][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,052][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,052][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,053][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,053][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,053][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,053][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,053][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,053][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,053][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,054][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,054][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,054][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,054][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,054][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,055][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,055][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,056][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,058][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,058][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,059][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,060][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,060][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,061][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,061][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,062][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,062][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,062][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,062][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,063][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,063][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,063][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,063][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,063][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,063][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,063][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,063][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,064][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,064][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,064][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,064][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,064][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,064][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,064][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,065][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,065][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,065][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,065][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,065][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,065][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,065][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,065][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,066][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,066][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,066][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,066][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,066][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,066][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,066][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,067][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,068][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,068][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,068][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,068][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,068][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,068][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,068][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,068][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,069][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,070][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,105][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,106][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,106][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,106][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,106][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,107][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,107][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,107][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,107][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,107][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,108][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,108][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,108][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,108][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,109][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,109][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,109][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,109][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,115][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,115][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,115][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,115][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,116][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,116][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,116][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,116][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,116][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,117][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,117][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,117][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,117][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,117][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,117][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,118][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,118][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,118][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,118][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,118][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,118][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,119][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,119][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,119][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,119][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,119][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,119][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,120][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,120][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,120][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,120][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,120][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,120][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,121][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,121][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,121][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,121][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,121][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,121][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,121][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,122][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,122][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,122][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,122][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,122][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,122][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,122][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,123][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,123][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,123][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,123][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,123][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,123][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,123][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,124][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,124][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,124][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,124][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,124][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,124][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,124][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,125][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,125][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,125][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,125][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,125][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,125][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,125][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,125][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,126][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,126][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,126][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,126][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,126][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,126][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,126][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,127][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,127][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,127][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,127][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,127][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,127][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,127][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,127][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,128][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,128][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,128][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,128][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,128][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,128][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,128][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,129][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,129][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,129][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,129][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,129][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,129][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,129][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,129][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,130][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,130][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,130][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,130][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,130][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,130][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,130][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,131][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,131][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,131][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,131][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,131][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,131][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,131][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,131][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,132][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,132][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,132][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,132][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,132][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,132][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,132][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,133][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,133][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,133][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,133][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,133][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,133][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,133][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,133][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,134][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,134][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,134][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,134][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,134][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,134][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,134][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,135][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,135][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,135][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,135][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,135][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,135][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,135][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,135][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,136][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,136][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,136][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,136][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,136][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,136][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,136][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,136][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,137][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,137][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,137][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,137][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,137][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,137][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,137][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,137][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,138][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,138][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,138][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,138][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,138][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,138][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,138][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,139][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,139][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,139][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,139][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,139][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,139][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,139][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,139][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,140][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,140][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,140][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,140][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,140][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,140][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,140][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,140][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,141][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,141][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,141][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,141][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,141][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,141][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,141][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,141][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,142][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,142][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,142][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,142][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,142][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,142][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,142][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,142][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,143][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,143][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,143][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,143][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,143][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,143][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,143][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,143][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,144][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,144][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,144][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,144][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,144][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,144][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,144][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,145][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,145][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,145][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,145][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,145][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,145][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,145][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,145][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,146][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,146][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,146][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,146][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,146][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,146][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,146][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,146][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,147][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,147][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,147][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,147][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,147][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,147][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,147][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,148][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,148][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,148][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,148][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,148][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,148][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,148][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,148][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,149][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,149][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,149][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,149][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,149][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,149][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,149][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,150][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,150][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,150][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,150][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,150][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,150][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,150][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,150][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,151][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,151][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,151][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,151][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,151][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,151][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,151][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,152][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,152][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,152][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,152][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,152][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,152][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,152][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,153][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,153][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,153][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,153][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,153][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,153][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,153][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,153][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,154][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,154][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,154][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,154][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,154][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,154][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,154][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,155][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,155][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,155][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,155][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,155][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,155][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,155][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,156][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,156][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,156][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,156][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,156][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,156][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,156][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,156][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,157][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,157][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,157][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,157][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,157][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,157][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,157][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,158][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,158][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,158][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,158][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,158][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,158][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,158][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,158][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,159][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,159][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,159][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,159][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,159][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,159][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,159][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,160][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,160][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,160][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,160][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,160][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,160][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,160][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,161][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,161][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,161][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,161][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,161][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,161][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,161][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,161][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,162][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,162][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,162][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,162][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,162][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,162][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,162][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,163][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,163][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,163][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,163][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,163][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,163][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,163][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,163][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,164][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,164][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,164][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,164][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,164][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,164][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,164][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,165][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,165][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,165][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,165][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,165][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,165][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,165][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,165][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,166][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,166][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,166][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,166][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,166][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,166][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,166][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,167][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,167][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,167][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,167][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,167][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,167][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,167][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,168][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,168][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,168][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,168][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,168][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,168][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,168][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,168][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,169][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,169][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,169][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,169][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,169][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,169][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,169][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,170][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,170][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,170][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,170][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,170][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,170][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,170][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,170][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,347][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,347][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,347][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,347][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,347][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,347][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,347][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,347][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,348][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,348][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,348][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,348][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,348][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,348][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,348][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,349][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,349][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,349][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,349][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,349][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,349][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,349][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,349][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,387][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,387][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,387][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,387][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,387][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,387][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,387][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,387][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,388][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,388][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,388][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,388][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,388][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,388][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,388][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,389][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,389][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,389][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,389][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,389][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,389][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,389][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,389][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,390][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,390][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,390][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,390][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,390][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,390][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,390][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,390][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,391][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,391][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,391][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,391][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,391][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,391][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,391][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,392][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,392][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,392][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,392][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,392][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,392][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,392][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,392][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,393][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,393][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,393][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,393][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,393][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,393][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,393][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,394][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,394][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,394][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,394][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,394][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,394][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,394][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,394][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,395][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,395][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,395][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,395][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,395][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,395][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,395][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,396][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,396][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,396][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,396][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,396][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,396][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,396][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,396][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,397][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,397][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,397][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,397][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,397][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,397][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,397][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,397][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,398][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,398][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,398][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,398][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,398][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,398][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,398][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,399][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,399][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,399][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,399][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,399][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,399][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,399][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,400][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,400][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,400][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,400][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,400][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,400][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,400][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,400][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,401][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,401][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,401][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,401][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,401][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,401][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,401][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,402][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,402][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,402][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,402][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,402][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,402][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,402][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,402][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,403][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,403][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,403][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,403][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,403][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,403][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,403][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,403][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,404][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,404][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,404][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,404][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,404][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,404][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,404][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,405][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,405][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,405][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,405][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,405][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,405][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,405][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,406][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,406][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,406][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,406][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,406][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,406][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,406][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,406][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,407][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,407][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,407][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,407][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,407][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,407][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,407][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,408][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,408][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,408][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,408][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,408][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,408][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,408][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,408][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,409][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,409][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,409][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,409][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,409][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,409][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,409][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,410][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,410][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,410][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,410][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,410][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,410][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,410][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,410][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,411][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,411][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,411][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,411][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,411][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,411][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,411][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,412][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,412][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,412][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,412][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,412][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,412][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,412][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,412][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,413][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,413][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,413][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,413][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,413][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,413][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,413][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,414][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,414][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,414][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,414][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,414][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,414][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,414][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,414][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,415][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,415][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,415][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,415][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,415][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,415][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,415][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,416][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,416][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,416][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,416][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,416][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,416][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,416][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,416][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,417][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,417][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,417][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,417][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,417][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,417][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,417][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,418][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,418][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,418][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,418][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,418][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,418][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,418][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,418][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,419][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,419][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,419][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,419][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,419][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,419][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,419][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,419][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,420][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,420][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,420][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,420][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,420][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,420][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,420][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,421][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,421][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,421][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,421][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,421][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,421][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,421][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,421][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,422][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,422][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,422][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,422][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,422][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,422][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,422][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,423][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,423][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,423][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,423][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,423][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,423][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,423][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,423][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,424][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,424][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,424][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,424][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,424][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,424][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,424][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,425][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,425][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,425][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,425][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,425][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,425][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,425][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,425][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,426][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,426][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,426][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,426][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,426][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,426][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,426][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,427][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,427][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,427][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,427][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,427][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,427][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,427][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,427][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,428][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,428][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,428][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,428][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,428][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,428][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,428][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,429][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,429][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,429][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,429][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,429][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,429][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,429][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,429][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,430][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,430][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,430][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,430][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,430][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,430][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,430][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,431][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,431][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,431][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,431][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,431][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,431][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,431][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,431][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,432][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,432][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,432][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,432][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,432][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,432][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,432][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,433][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,433][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,433][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,433][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,433][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,433][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,433][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,433][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,434][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,434][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,434][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,434][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,434][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,434][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,434][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,435][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,435][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,435][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,435][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,435][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,435][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,435][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,436][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,436][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,436][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,436][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,436][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,436][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,436][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,436][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,437][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,437][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,437][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,437][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,437][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,437][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,437][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,438][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,438][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,438][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,438][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,438][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,438][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,438][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,438][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,439][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,439][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,439][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,439][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,439][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,439][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,439][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,440][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,440][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,440][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,440][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,440][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,440][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,440][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,440][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,441][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,441][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,441][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,441][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,441][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,441][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,442][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,442][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,442][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,442][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,442][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,442][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,442][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,442][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,443][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,443][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,443][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,443][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,443][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,443][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,443][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,444][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,444][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,444][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,444][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,444][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,444][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,444][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,444][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,445][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,445][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,445][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,445][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,445][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,445][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,445][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,446][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,446][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,446][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,446][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,446][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,446][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,446][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,446][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,447][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,447][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,447][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,447][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,447][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,447][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,447][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,448][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,448][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,448][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,448][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,448][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,448][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,448][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,448][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,449][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,449][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,449][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,449][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,449][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,449][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,449][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,450][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,450][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,450][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,450][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,450][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,450][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,450][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,450][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,451][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,451][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,451][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,451][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,451][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,451][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,451][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,452][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,452][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,452][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,452][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,452][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,452][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,452][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,452][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,453][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,453][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,453][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,453][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,453][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,453][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,453][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,453][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,454][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,454][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,454][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,454][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,454][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,454][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,454][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,455][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,455][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,455][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,455][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,455][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,455][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,455][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,456][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,456][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,456][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,456][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,456][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,456][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,456][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,456][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,457][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,457][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,457][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,457][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,457][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,457][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,457][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,458][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,458][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,458][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,458][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,458][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,458][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,458][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,458][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,459][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,459][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,459][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,459][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,459][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,459][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,459][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,460][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,460][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,460][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,460][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,460][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,460][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,460][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,460][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,461][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,461][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,461][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,461][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,461][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,461][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,461][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,462][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,462][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,462][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,462][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,462][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,462][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,462][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,462][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,463][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,463][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,463][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,463][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,463][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,463][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,463][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,464][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,464][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,464][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,464][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,464][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,464][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,464][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,464][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,465][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,465][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,465][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,465][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,465][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,465][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,465][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,466][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,466][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,466][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,466][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,466][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,466][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,466][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,466][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,467][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,467][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,467][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,467][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,467][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,467][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,467][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,468][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,468][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,468][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,468][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,468][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,468][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,468][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,468][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,469][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,469][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,469][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,469][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,469][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,469][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,469][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,470][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,470][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,470][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,470][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,470][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,470][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,470][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,470][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,471][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,471][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,471][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,471][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,471][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,471][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,471][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,471][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,472][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,472][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,472][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,472][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,472][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,472][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,472][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,473][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,473][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,473][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,473][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,473][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,473][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,473][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,473][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,474][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,474][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,621][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,621][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,621][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,621][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,671][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,671][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,671][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,671][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,671][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,671][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,671][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,672][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,672][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,672][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,672][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,672][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,672][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,672][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,674][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,674][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,674][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,674][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,674][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,674][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,674][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,675][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,675][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,675][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,675][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,675][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,675][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,675][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,676][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,676][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,676][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,676][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,676][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,676][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,676][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,676][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,677][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,677][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,677][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,677][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,677][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,677][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,677][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,678][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,678][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,678][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,678][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,678][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,678][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,678][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,679][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,679][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,679][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,679][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,679][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,679][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,679][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,679][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,680][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,680][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,680][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,680][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,680][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,680][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,680][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,681][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,681][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,681][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,681][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,681][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,681][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,681][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,681][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,682][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,682][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,682][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,682][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,682][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,682][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,682][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,683][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,683][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,683][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,683][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,683][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,683][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,683][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,684][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,684][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,684][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,684][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,684][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,684][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,684][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,684][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,685][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,685][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,685][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,685][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,685][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,685][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,685][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,686][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,686][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,686][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,686][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,686][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,686][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,686][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,686][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,687][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,687][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,687][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,687][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,687][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,687][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,687][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,688][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,688][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,688][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,688][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,688][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,688][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,688][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,688][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,689][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,689][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,689][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,689][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,689][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,689][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,689][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,690][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,690][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,690][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,690][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,690][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,690][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,690][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,690][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,691][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,691][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,691][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,691][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,691][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,691][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,691][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,692][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,692][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,692][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,692][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,692][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,692][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,692][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,692][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,693][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,693][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,693][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,693][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,693][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,693][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,693][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,694][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,694][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,694][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,694][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,694][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,694][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,694][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,694][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,695][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,695][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,695][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,695][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,695][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,695][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,695][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,696][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,696][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,696][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,696][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,696][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,696][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,696][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,697][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,697][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,697][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,697][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,697][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,697][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,697][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,697][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,698][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,698][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,698][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,698][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,698][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,698][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,698][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,699][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,699][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,699][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,699][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,699][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,699][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,699][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,699][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,700][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,700][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,700][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,700][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,700][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,700][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,700][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,701][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,701][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,701][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,701][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,701][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,701][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,701][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,701][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,702][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,702][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,702][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,702][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,702][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,702][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,702][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,703][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,703][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,703][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,703][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,703][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,703][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,703][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,703][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,704][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,704][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,704][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,704][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,704][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,704][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,704][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,705][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,705][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,705][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,705][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,705][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,705][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,705][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,706][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,706][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,706][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,706][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,706][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,706][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,706][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,706][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,707][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,707][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,707][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,707][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,707][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,707][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,707][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,708][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,708][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,708][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,708][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,708][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,708][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,708][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,709][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,709][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,709][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,709][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,709][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,709][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,709][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,709][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,710][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,710][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,710][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,710][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,710][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,710][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,710][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,711][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,711][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,711][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,711][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,711][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,711][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,711][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,711][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,712][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,712][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,712][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,712][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,712][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,712][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,712][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,713][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,713][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,713][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,713][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,713][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,713][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,713][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,714][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,714][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,714][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,714][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,714][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,714][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,714][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,714][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,715][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,715][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,715][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,715][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,715][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,715][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,715][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,716][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,716][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,716][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,716][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,716][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,716][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,716][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,716][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,717][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,717][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,717][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,717][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,717][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,717][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,717][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,718][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,718][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,718][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,718][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,718][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,718][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,718][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,718][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,719][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,719][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,719][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,719][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,719][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,719][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,719][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,720][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,720][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,720][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,720][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,720][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,720][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,720][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,720][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,721][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,721][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,721][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,721][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,721][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,721][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,721][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,722][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,722][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,722][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,722][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,722][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,722][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,722][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,722][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,723][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,723][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,723][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,723][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,723][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,723][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,723][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,724][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,724][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,724][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,724][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,724][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,724][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,724][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,725][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,725][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,725][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,725][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,725][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,725][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,725][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,725][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,726][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,726][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,726][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,726][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,726][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,726][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,726][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,727][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,727][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,727][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,727][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,727][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,727][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,727][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,727][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,728][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,728][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,728][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,728][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,728][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,728][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,728][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,729][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,729][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,729][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,729][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,729][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,729][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,729][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,730][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,730][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,730][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,730][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,730][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,730][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,730][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,731][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,731][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,731][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,731][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,731][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,731][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,731][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,731][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,732][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,732][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,732][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,732][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,732][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,732][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,732][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,733][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,733][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,733][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,733][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,733][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,733][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,733][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,733][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,734][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,734][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,734][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,734][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,734][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,734][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,734][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,735][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,735][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,735][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,735][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,735][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,735][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,735][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,735][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,736][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,736][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,736][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,736][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,736][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,736][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,736][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,737][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,737][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,737][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,737][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,737][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,737][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,737][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,738][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,738][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,738][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,738][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,738][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,738][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,738][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,738][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,739][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,739][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,739][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,739][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,739][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,739][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,739][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,740][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,740][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,740][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,740][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,740][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,740][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,740][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,740][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,741][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,741][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,741][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,741][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,741][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,741][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,741][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,742][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,742][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,742][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,742][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,742][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,742][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,742][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,742][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,743][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,743][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,743][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,743][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,743][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,743][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,743][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,744][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,744][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,744][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,744][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,744][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,744][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,744][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,744][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,745][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,745][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,745][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,745][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,745][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,745][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,745][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,746][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,746][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,746][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,746][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,746][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,746][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,746][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,746][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,747][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,747][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,747][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,747][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,747][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,747][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,747][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,748][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,748][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,748][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,748][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,748][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,748][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,748][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,748][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,749][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,749][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,749][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,749][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,749][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,749][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,749][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,750][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,750][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,750][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,750][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,750][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,750][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,750][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,750][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,954][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,954][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,954][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,954][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,954][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,954][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,954][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,954][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,955][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,955][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,955][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,955][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,955][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,955][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,955][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,955][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,956][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,956][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,956][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,956][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,956][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,956][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,956][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,957][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,957][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,957][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,957][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,957][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,957][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,957][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,957][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,958][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,958][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,958][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,958][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,958][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,958][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,958][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,959][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,959][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,959][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,959][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,959][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,959][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,959][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,959][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,960][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,960][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,960][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,960][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,960][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,960][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,960][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,961][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,961][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,961][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,961][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,961][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,961][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,961][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,961][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,962][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,962][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,962][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,962][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,962][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,962][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,962][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,963][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,963][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,963][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,963][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,963][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,963][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,963][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,963][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,964][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,964][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,964][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,964][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,964][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,964][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,964][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,964][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,965][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,965][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,965][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,965][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,965][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,965][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,965][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,966][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,966][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,966][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,966][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,966][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,966][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,966][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,966][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,967][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,967][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,967][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,967][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,967][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,967][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,967][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,968][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,968][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,968][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,968][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,968][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,968][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,968][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,968][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,969][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,969][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,969][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,969][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,969][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,969][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,969][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,970][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,970][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,970][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,970][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,970][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,970][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,970][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,970][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,971][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,971][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,971][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,971][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,971][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,971][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,971][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,971][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,972][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,972][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,972][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,972][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,972][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,972][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,972][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,973][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,973][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,973][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,973][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,973][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,973][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,973][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,973][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,974][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,974][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,974][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,974][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,974][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,974][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,974][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,975][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,975][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,975][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,975][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,975][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,975][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,975][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,975][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,976][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,976][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,976][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,976][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,976][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,976][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,976][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,977][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,977][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,977][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,977][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,977][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,977][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,977][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,977][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,978][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,978][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,978][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,978][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,978][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,978][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,978][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,978][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,979][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,979][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,979][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,979][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,979][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,979][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,979][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,980][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,980][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,980][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,980][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,980][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,980][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,980][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,980][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,981][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,981][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,981][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,981][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,981][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,981][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,981][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,982][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,982][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,982][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,982][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,982][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,982][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,982][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,982][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,983][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,983][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,983][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,983][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,983][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,983][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,983][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,984][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,984][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,984][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,984][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,984][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,984][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,984][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,984][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,985][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,985][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,985][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,985][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,985][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,985][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,985][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,986][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,986][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,986][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,986][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,986][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,986][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,986][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,986][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,987][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,987][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,987][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,987][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,987][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,987][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,987][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,988][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,988][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,988][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,988][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,988][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,988][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,988][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,988][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,989][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,989][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,989][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,989][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,989][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,989][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,989][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,990][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,990][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,990][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,990][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,990][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,990][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,990][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,990][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,991][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,991][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,991][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,991][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,991][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,991][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,991][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,991][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,992][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,992][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,992][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,992][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,992][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,992][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,992][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,993][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,993][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,993][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,993][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,993][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,993][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,993][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,993][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,994][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,994][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,994][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,994][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,994][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,994][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,994][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,995][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,995][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,995][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,995][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,995][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,995][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,995][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,995][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,996][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,996][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,996][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,996][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,996][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,996][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,996][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,996][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,997][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,997][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,997][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,997][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,997][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,997][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,997][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,998][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,998][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,998][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,998][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,998][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,998][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,998][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,998][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,999][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,999][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,999][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,999][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,999][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,999][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:11,999][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,000][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,000][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,000][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,000][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,000][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,000][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,000][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,000][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,001][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,001][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,001][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,001][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,001][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,001][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,001][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,001][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,002][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,002][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,002][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,002][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,002][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,002][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,002][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,003][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,003][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,003][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,003][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,003][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,003][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,003][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,003][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,004][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,004][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,004][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,004][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,004][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,004][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,004][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,004][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,005][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,005][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,005][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,005][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,005][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,005][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,005][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,006][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,006][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,006][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,006][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,006][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,006][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,006][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,006][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,007][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,007][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,007][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,007][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,007][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,007][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,007][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,008][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,008][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,008][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,008][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,008][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,008][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:12,008][asyncio][WARNING] - socket.send() raised exception.